I don’t know under what context you might be looking at this html, but hello I’m Tim. I’m writing this preface so that no matter where you’re coming from I can explain what you’re about to see.
This is a data analysis I completed on some survey data I collected from a Facebook group known as (Sandy) Alex G 666posting. For those of you that don’t know, Alex G is a prolific indie artist who has been putting out music since at least 2010. His popularity has only continued to grow with each album release, but what makes him such an interesting musician is, including his leaked unreleased tracks, Alex has 200+ songs, which to me is impressive.
I joined the group back in Jan of 2019 as I got into Alex’s unreleased discography and have come to really enjoy the community that exists there. One day an individual, Zion, asked how old they were when they started listening to Alex G and how old they were now. I asked if it was okay, collected that data, and made a figure for the group showing the distribution of ages and other ones as well. I then asked if it was okay to make a survey for the group to analyze more data, I got the okay, specifically from Isabelle, and began working. Many individuals helped with the survey as you can see in the bottom, but the result was 38 columns of data from 211 individuals. Some questions Alex G related, some not. But I hope you find something interesting in this little project I made. I’ve been working on it on and off in my free time since late April. Either way, its a labor of love to give back to the great community I found online. The group is full of cool people and I’m glad to be a part of it (I also think this project helped me become a mod admin so that was nice).
At the time of me writing this 666posting is at the cusp of reaching 3,000 members. That’s great, but we need to talk about sample size because that means we only have ~7.0333333% of the group accounted for. Not only that but as this is an optional survey, one must be aware of the type of people who would take the time to fill out the said survey. This is not an unbiased slice of the group, but I do like to think this includes the core group of active individuals. So with that in mind, let’s get into the analysis.
tidyverse: Used for general exploratory analysis, primarily used dplyr within itztable: Used to make various tablesUpSetR: Used to make Upset plotswordcloud: Used to make word clouds (surprise!)rworldmap: The map of the world!mapproj: The map of… US?wesanderson: Wes Anderson Color Palettesmaps: More mapsviridis: Used for colorblind-friendly paletterstatix: Stat testsEnvStats: Used to get the N on ggplots for groupingsggpubr: Arranging the plotsspotifyr: Getting that spotify data!scales: Percent axisreshape2: Data meltingggforce: Sina Plotsset.seed(123) This was just to keep the consistency of the word clouds
## Loading the data and prepping functions
#functions
hi_low_fact <- function(x, decrease = F){
names(table(x)[order(table(x), decreasing = decrease)])
}
auto_ztable <- function(df, col1, col2, title, simul_p = F, pval_digits = 3, marg = 0) {
temp_tab <- table(df[[col1]], df[[col2]])
temp_ztab <- ztable(temp_tab)
chi_result <- chisq.test(df[[col1]], df[[col2]], simulate.p.value = simul_p)
temp_ztab %>% makeHeatmap(margin = marg) %>%
print(caption = paste0(title," ; Chi-Sqare p-value of ", format.pval(chi_result$p.value, digits = pval_digits)))
}
to_zcore <- function(x){
(x - mean(x))/sd(x)
}
dat <- dat %>% mutate(
race = str_to_title(race),
orientation = str_to_title(orientation),
gender = factor(gender, hi_low_fact(gender, decrease = F)),
transgender = factor(transgender, levels = c("No", "Yes", "Prefer not to answer")),
polyamorous = ifelse(polyamorous == "Prefer not to say", "Prefer not to answer", polyamorous),
polyamorous = factor(polyamorous, levels = c("No", "Yes", "Prefer not to answer")),
generation = ifelse(generation == "", "Prefer not to say", generation)
)
So the first demographics we’ll take a quick peek at is at age and race. First, we’ll look at age at a histogram colored on “what generation do you identify with”. Get used to the age variable, as it is one of our few numerical values, so I’m going to be plotting it a lot. Anyways, overall age is right-skewed, with only one real outlier at 42. Alex G is about 26-27 at the time of me writing this and has been writing music since his early teenage years, so the demographics of teens and 20-somethings is unsurprising.
As a straight white dude (who we will see is the average Alex G listener), I can’t really speak to why or why not Alex is popular or not popular with the other demographics, and frankly I’m not sure it is my place to speculate. As a result, for race, gender, trans identity, and polyamory I will let you draw your own conclusions. I just don’t feel I as an individual should be speaking didactically about such a subject.
#dat <- read.csv("inputs/alex_g_stats_9_24.csv", stringsAsFactors = F)
#Age
age_plot <- ggplot(dat, aes(age, fill = generation)) +
geom_histogram(breaks=seq(15, 45, by=1), color = "black")+
ylab("Count") +
xlab("Age") +
theme_classic() +
scale_x_continuous(breaks = round(seq(min(dat$age, na.rm = T), max(dat$age, na.rm = T), by = 5)/5) * 5)
age_summary <- round(t(as.matrix(summary(dat$age))), digits = 2)
rownames(age_summary) <- "Age"
age_summ_table <- ggtexttable(age_summary)
#Race plot
race_dat <- dat %>% count(race)
race_plot <- ggplot(race_dat, aes(race, n)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_classic() +
geom_text(aes(label=n), nudge_y = 2.5) +
ylab("Count") +
xlab("Race")
dem_plot <- ggarrange(age_plot, age_summ_table, race_plot, nrow = 3,
heights = c(1, .3, 1))
annotate_figure(dem_plot, top = text_grob(
"Age and Race of individuals in 666posting"
))
#gender plot
gender_plot <- ggplot(dat, aes(gender, fill = gender)) +
geom_bar() +
coord_flip() +
theme_classic() +
theme(legend.position="bottom") +
geom_text(stat='count', aes(label=..count..), nudge_y = 2.5)
#gender_plot
#trans identity plot
trans_dat <- filter(dat, transgender != "") %>% select(age, transgender)
trans_plot <- ggplot(trans_dat, aes(transgender)) + geom_bar() +
coord_flip()+
theme_classic() +
geom_text(stat='count', aes(label=..count..), nudge_y = 4)
#trans_plot
#orientation plot
orientation_totals <- dat %>% group_by(orientation) %>%
count(orientation)
orientation_plot <- ggplot(dat, aes(orientation, fill = gender)) +
geom_bar(position = "dodge") +
coord_flip() +
theme_classic() +
theme(legend.position="bottom")+
geom_text(stat='count',
aes(label=..count..),
position = position_dodge(width = 1),
hjust = -.08,
size = 4)
#geom_text(aes(orientation, n, label = n, fill = NULL), data = orientation_totals, nudge_y = 4)
#polyamorous identity plot
poly_totals <- dat %>% group_by(polyamorous) %>%
count(polyamorous)
poly_plot <- ggplot(dat, aes(polyamorous, fill = gender)) +
geom_bar(position = "Dodge") +
coord_flip() +
theme(legend.position = "none") +
theme_classic() +
geom_text(stat='count', aes(label=..count..), position = position_dodge(width = 1), hjust = -.08)
# geom_text(aes(polyamorous, n, label = n, fill = NULL), data = poly_totals, nudge_y = 4)
#poly_plot
gpt_plot <- ggarrange(gender_plot, orientation_plot,
poly_plot, trans_plot,
nrow = 4, common.legend = T,
heights = c(.4, 1.5, 1, .8))
annotate_figure(gpt_plot, top = text_grob(
"Gender count, polyamory colored on gender, and trans identity"
))
This is a plot I had inspired this project, how? Quick storytime:
Zion, a member of 666posting asked individuals “What age were you when you discovered Alex G and what age are you now?”. Everyone started answering the question and I found it interesting. Eventually, it had over 200+ individuals and I came up with an idea on making two overlapping histograms of people ages of learning about Alex G and their current age. I asked for permission and then I began the first analysis of 666posting. As you’ll see I didn’t stop there with that data since I thought of other things to do with it, but this eventually led to me asking if I could run a survey which leads to this whole R markdown. So really Zion inspired me to do this, so if you’re reading this, thanks dude!
## One of the data points I make sure to make is the year learned people learned about Alex G, using the fact this survey was done in 2020 as a reference. This will be used later
##Two histograms
dat <- dat %>% mutate(difference = age - first_age, year_learned = 2020 - difference)
#melt_dat <- melt(dat %>% select(-difference, -year_learned))
melt_dat <- melt(dat %>% select(age, first_age))
#get mean
start_mean <- mean(dat$first_age)
start_median <- median(dat$first_age)
end_mean <- mean(dat$age)
end_median <- median(dat$age)
sum_dat <- NULL
sum_dat$mean <- c(end_mean, start_mean)
sum_dat$median <- c(end_median, start_median)
sum_dat$group <- factor(c("end", "start"), levels = c("start", "end"))
sum_dat <- as.data.frame(sum_dat)
# Interweaved histograms
ggplot(melt_dat, aes(x=value, fill=variable)) +
geom_histogram(position="identity", alpha=0.5, color = "black" ,
#bins = (max(melt_dat$value) - min(melt_dat$value)
binwidth = 1)+
scale_x_continuous(breaks=seq(min(melt_dat$value),max(melt_dat$value),2)) +
#geom_vline(data=sum_dat, aes(xintercept=mean, color=group), size = 1.5,
# linetype="dashed")+
#geom_vline(data=sum_dat, aes(xintercept=median, color=group), size = 1.5,
#linetype="dashed") +
geom_vline(xintercept = c(end_median, start_median), size = 1.8,
linetype="dashed", color = c("royalblue3","red3")) +
annotate(geom="text", vjust = 1.7, size = 3.5, x = c(35),
y = c(13.1, 16), label = c("Current Age Median", "Starting Age Median")) +
geom_segment(aes(x = 32, y = 15.5, xend = 38, yend = 15.5),
linetype="dashed", size = 1.5, color = "red3")+
geom_segment(aes(x = 32, y = 12.5, xend = 38, yend = 12.5),
linetype="dashed", size = 1.5, color = "royalblue3")+
labs(title = "Age of Finding Alex G and Current Age", subtitle = "Overlapping Histogram", y = "# of Individuals", x = "Age", legend = "") +
scale_fill_discrete(name = "",labels=c("Current Age", "Starting Listening Age")) +
theme_classic(base_size = 13) + theme(legend.position="top")
ggsave("start_age_current.png", width = 9, height = 7)
Another plot I made, using the aforementioned “year learning of Alex G” data I extrapolated is a diagram of his fan base growth over time. I did this by using the “first year people learned of Alex G” data, and overlaying that with his album releases. However, unlike my initial making of this plot based on people just commenting on a Facebook post, this data is a little… odd.
##Second histogram
fact_order <- as.character(sort(unique(dat$year_learned)))
##Add album years
race = 2010
winner = 2011
Rules_Trick = 2012
DSU = 2014
beach_music = 2015
rocket = 2017
HoS = 2019
dat_2 <- (dat %>% select(year_learned))
ggplot(dat_2, aes(x=year_learned)) +
geom_histogram(color = "black", fill = "grey", binwidth = 1)+
scale_x_continuous(breaks=seq(2009,2020,2)) +
geom_vline(xintercept = race, linetype="dotted",
color = "deeppink1", size=1.5) +
geom_vline(xintercept = winner,
color = "yellow3", size=1.5, linetype ="dotted") +
geom_vline(xintercept = Rules_Trick,
color = "deepskyblue1", size= 1.5, linetype ="dotted") +
geom_vline(xintercept = DSU,
color = "springgreen4", size= 1.5, linetype ="dotted") +
geom_vline(xintercept = beach_music,
color = "royalblue4", size= 1.5, linetype ="dotted") +
geom_vline(xintercept = rocket,
color = "red3", size= 1.5, linetype ="dotted") +
geom_vline(xintercept = HoS,
color = "darkorchid4", size= 1.5, linetype ="dotted") +
annotate(geom="text", angle = 90, vjust = 1.3, size = 5,
x=c(race, winner, Rules_Trick, DSU, beach_music, rocket, HoS),
y=c(10, 10, 10, 10, 10, 10, 10),
label=c("Race", "Winner", "Rules/Trick", "DSU", "Beach Music", "Rocket", "House of Sugar"),
#color="red"
color = c("deeppink", "yellow3", "deepskyblue1", "springgreen4", "royalblue4", "red3", "darkorchid4")
) +
labs(title = "Histogram of when people first heard of Alex G", x = "Year First Heard", y = "# of Individuals") +
theme_classic(base_size = 13)
ggsave("year_start_listening.png", width = 9, height = 7)
So below is the original figure.
And compared to this figure our new one is definitely different. First of all, we have much more of a normal distribution besides the person… who first heard Alex G in ’07. Frankly, I’m a little dubious and would be interested in hearing that story. The individual did not identify which state they live in, so I thought my worries would be assuaged by hearing they lived in PA, but alas, I have no idea. The other interesting aspect is our first non ’07 fan comes in around 2011, after Race, an arguably seminal Alex G album. I would be expecting a larger boost with Race, and the previous plot I made did have individuals listening around the time of Race, admittedly few, but more than none. It is important to note Alex G was making music with his band the Skin Cells during the early years as well, but I’m less familiar with that history.
Also compared to the previous plot we have a lot more 2018-20 fans, but I think that’s primarily because the original data was pulled from late March/early April 2020, and this current survey has been rolling open the entire year, so that is not unexpected. But on the topic of the actual data, the data is for the most part normal, which is more indicative of the sample we’re pulling from, as I presume should just continue increasing with each album as his fame has only grown. Maybe there is a critical limit of indie fans that he might’ve reached, who knows! I personally believe this is also due to the fact that people trickle into these Facebook groups, so while more people might like Alex now, they might not be in the group (or willing to answer a silly survey). It is interesting that 2015 and the release of Beach Music is his largest jump in popularity, as it is also the same year he signed with Domino Records who probably helped promote his work to a wider audience.
This long.
dat_3 <- dat %>% select(difference)
ggplot(dat_3, aes(x=difference)) +
geom_histogram(color = "black", fill = "grey", binwidth = 1) +
scale_x_continuous(breaks=seq(min(dat_3$difference),max(dat_3$difference),1)) +
# geom_vline(xintercept = mean(dat_3$difference)) +
labs(title = "How Many Years Have We Been Listening?", x ="Length of Time in Years", y = "# of Indivduals") +
theme_classic(base_size = 17)
ggsave("how_long_listening.png", width = 9, height = 7)
It’s the previous plot inverted, what can I say.
I just like cumulative distribution plots, it’s all the same data. If you don’t know how cumulative distribution plots work, basically the line shows over the years the growth of Alex’s fanbase as if counting up to the total, so you can see his percentage gain each year to his current 100% of the fanbase.
ggplot(dat, aes(x=year_learned)) + stat_ecdf() + scale_x_continuous(breaks=seq(2009,2020,2))+
scale_y_continuous(labels = percent) +
geom_vline(xintercept = race, linetype="dotted",
color = "deeppink1", size=1.5) +
geom_vline(xintercept = winner,
color = "yellow3", size=1.5, linetype ="dotted") +
geom_vline(xintercept = Rules_Trick,
color = "deepskyblue1", size= 1.5, linetype ="dotted") +
geom_vline(xintercept = DSU,
color = "springgreen4", size= 1.5, linetype ="dotted") +
geom_vline(xintercept = beach_music,
color = "royalblue4", size= 1.5, linetype ="dotted") +
geom_vline(xintercept = rocket,
color = "red3", size= 1.5, linetype ="dotted") +
geom_vline(xintercept = HoS,
color = "darkorchid4", size= 1.5, linetype ="dotted") +
annotate(geom="text", angle = 90, vjust = 1.3, size = 5,
x=c(race, winner, Rules_Trick, DSU, beach_music, rocket, HoS),
y=c(.10, .20, .30, .50, .80, .60, .75),
label=c("Race", "Winner", "Rules/Trick", "DSU", "Beach Music", "Rocket", "House of Sugar"),
#color="red"
color = c("deeppink", "yellow3", "deepskyblue1", "springgreen4", "royalblue4", "red3", "darkorchid4")
) +
labs(title = "How has Alex G's Fanbase Grown?",subtitle = "Cummulative Distribution Plot of Alex G Fanbase Growth", x ="Year", y = "Percent of Current Fanbase") +
theme_classic(base_size = 13)
ggsave("alex_g_fanbase_growth.png", width = 9, height = 7)
So Alex G, for those not in the know, lives in Philly. But, like most successful artists, people want to see him perform in other cities and he tours pretty regularly. So I thought it would be interesting to see where Alex G fans are all over the world!
country_codes <- read.csv("inputs/countrycode.csv", stringsAsFactors = F)
#Get a count of fans per each country
world_dat <- as.data.frame(table(dat$country), stringsAsFactors = F)
colnames(world_dat) <- c("Country.Name", "fans")
#Join with official data
world_dat <- left_join(country_codes, world_dat)
world_needed <- world_dat %>% select(fans, ISO2, ISO3, Country.Name)
#colors
pal <- wes_palette("Zissou1", 100, type = "continuous")
#Get a higher resolution
worldmap <- getMap(resolution = "low")
worldmap@data <- left_join(worldmap@data, world_needed)
worldmap_poly <- fortify(worldmap)
worldmap_poly <- merge(worldmap_poly, worldmap@data, by.x="id", by.y="ADMIN", all.x=T)
#Make fans per 100,000
worldmap_poly <- worldmap_poly %>% mutate(fans_per = ifelse(is.na(fans), 0, fans),
fans_per = round((fans/POP_EST * 100000), digits = 3))
#Remove the US because they have to many individuals
worldmap_poly_nUSA <- worldmap_poly %>% filter(Country.Name != "United States")
A simple plot of the whole world colored on fans per 100,000 individuals in each of those countries.
#Whole world with US
ggplot() +
coord_map(xlim = c(-180, 180), ylim = c(-60, 75)) +
geom_polygon(data = worldmap_poly, aes(long, lat, group = group, fill = fans_per),size = 0.3) +
geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=fans), size=1.5) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
theme_void() +
labs(fill = "Fans per
100,000 indv")
HOLD UP, WHERE IS ALL THE RED? You may say, to which I respond, “GIVE ME A SECOND I’LL GET TO IT WE GOTTA ZOOM IN EVERYWHERE FIRST TO FIND IT”
ggplot() +
coord_map(xlim = c(-180, 180), ylim = c(-60, 75)) +
geom_polygon(data = worldmap_poly, aes(long, lat, group = group, fill = fans_per),size = 0.3) +
geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=fans), size=5) +
coord_cartesian(xlim = c(-150, -50), ylim = c(15, 60)) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) + theme_void( base_size = 14) +
labs(title = "Alex G Fans in North America",
fill = "Fans per
100,000 indv")
ggsave("outputs/north_america.png")
###South america
ggplot() +
coord_map(xlim = c(-180, 180), ylim = c(-60, 75)) +
geom_polygon(data = worldmap_poly_nUSA, aes(long, lat, group = group, fill = fans_per), color = "black" ,size = 0.3) +
geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=fans), size=5)+
coord_cartesian(xlim = c(-90, -30), ylim = c(-60, 15)) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) + theme_void( base_size = 14) +
labs(title = "Alex G Fans in South America",
fill = "Fans per
100,000 indv")
ggsave("outputs/south_america.png", height = 8, width = 7)
###Oceana and south east asia
ggplot() +
coord_map(xlim = c(-180, 180), ylim = c(-60, 75)) +
geom_polygon(data = worldmap_poly_nUSA, aes(long, lat, group = group, fill = fans_per), color = "black",size = 0.3) +
geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=fans), size=5)+
coord_cartesian(xlim = c(100, 180), ylim = c(-50, 45)) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) + theme_void( base_size = 14) +
labs(title = "Alex G Fans in East Asian and Oceania",
fill = "Fans per
100,000 indv")
ggsave("outputs/south_asia_aus.png", height = 8, width = 7)
###Europe
ggplot() +
coord_map(xlim = c(-180, 180), ylim = c(-60, 75)) +
geom_polygon(data = worldmap_poly_nUSA, aes(long, lat, group = group, fill = fans_per), color = "black",size = 0.3) +
geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=(fans)), size=5)+
coord_cartesian(xlim = c(-10, 30), ylim = c(30, 70)) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) + theme_void( base_size = 14) +
labs(title = "Alex G Fans in Europe",
fill = "Fans per
100,000 indv") +
geom_text(aes(x = 18.3754, y = 34.3), label = "Malta is where all the
red on this map is")
ggsave("outputs/europe_current.png", height = 8, width = 7)
###Europe
ggplot() +
coord_map(xlim = c(-180, 180), ylim = c(-60, 75)) +
geom_polygon(data = worldmap_poly_nUSA, aes(long, lat, group = group, fill = fans_per), color = "black",size = 0.3) +
geom_text(data=subset(worldmap@data, !is.na(SOVEREIGNT)), aes(x=LON, y=LAT,label=(fans)), size=5)+
coord_cartesian(xlim = c(12, 16), ylim = c(35, 37)) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) + theme_void( base_size = 14) +
labs(title = "The Alex G Malta Fan",
fill = "Fans per
100,000 indv") +
geom_text(aes(x = c(15.10, 14), y = c(35.8, 37)), label = c("Behold, Malta,
population of ~500,000", "This is Italy, in case you're garbage with geography like me"))
ggsave("outputs/europe_current.png", height = 8, width = 7)
So with all that said, the only interesting thing I could think of to compare this data to is what countries does Alex G most frequently tour in so… tada, I took this from setlist.fm, they actually have interesting concert statistics so I suggest checking them out. I actually had a project idea where I was gonna scrape their data and make an average Alex G setlist for each year and generate a naive Bayes model/shiny app allowing people to put in their perfect setlist and return the probability of that concert happening. But they already did the first thing, and I ended up doing this entire analysis instead of the naive Bayes thing, that’s life. Right, TADA:
Alex G Concerts Around the World from setlist.fm
So I put that data into the following csv file I load in. And I want to see if the number of fans correlate with how many concerts Alex has in that country. But you can see when I first saw the data….
concerts_world <- read.csv("inputs/countries_concerts.csv", fileEncoding = "UTF-8-BOM")
joined_country_concerts <- left_join(concerts_world, world_dat)
ggplot(joined_country_concerts, aes(y = concerts, x = fans, label = Country.Name)) +
geom_point() +
labs(title = "Thanks America, guess we'll log scale") +
theme_classic()
America ruined the graph. So I decided to log scale it and make it more informative with shapes and colors.
#We keepin Ireland
#joined_country_concerts <- joined_country_concerts %>% filter(!is.na(joined_country_concerts$fans))
shape_levels <- rev(1:nlevels(as.factor(joined_country_concerts$Country.Name)))
ggplot(joined_country_concerts, aes(y = concerts,
x = fans,
shape = Country.Name,
color = Country.Name)) +
geom_point(size = 3) +
scale_x_continuous(breaks=c(0,1,2,3,4,5,10,30,100,300), trans="log1p") +
scale_y_continuous(breaks=c(0,1,2,3,4,5,10,30,100,300), trans="log1p") +
scale_shape_manual(values=shape_levels) +
labs(title = "Alex G Concerts compared to number of fans, log10 + 1") +
theme_classic()
Ireland had 2 concerts and no fans which is why you can’t see it :(
So now let us look at the Alex G Fans around the US.
I used the below vignette to do guide this section https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html
us_states <- map_data("state")
#head(us_states)
#Filter state data
state_dat <- dat %>% filter(!(state %in% c("I don't live in the states", ""))) %>% mutate(region = tolower(state))
state_df <- data.frame(table(state_dat$region),
colnames = "regions",
stringsAsFactors = F) %>%
select(region = Var1, count = Freq)
us_states_2 <- left_join(us_states, state_df)
I got the census data from census.gov particularly the file named: NST-EST2019-01: Table 1. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2019 I’m not gonna lie, I was lazy and cleaned up the data in excel, but basically I selected the 2019 data only as it was most relevant. I used this data primarily to control for population in the coloration of the maps below. I don’t really have much to say on them so enjoy!
#######
census <- read.csv("inputs/state_census.csv", stringsAsFactors = F) %>%
mutate(region = tolower(str_replace(region, pattern = "[.]", replacement = ""))) %>% select(region, X2019)
census <- census %>% mutate(X2019= round(as.numeric(str_replace_all(X2019, ",", ""))))
#100000
us_states_3 <- left_join(us_states_2, census) %>% mutate(count = ifelse(is.na(count), 0, count),
adjust = round((count/X2019 * 100000), digits = 3))
## This data I'm gonna use to center text in the states, it's basically
## each state’s central longitude and latitude.
center_state <- read.csv(file = "inputs/statelatlong_2.csv", stringsAsFactors = F) %>% rename(region = name, lat_center = latitude, long_center = longitude) %>%
mutate(region = tolower(region))
us_states_3 <- left_join(us_states_3, center_state)
p <- ggplot(data = us_states_3,
mapping = aes(x = long, y = lat,
group = group, fill = adjust), color = "black")
#All of US
p + geom_polygon(color = "gray90", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
geom_text(aes(long_center, lat_center, label = count), size = 3.2) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
theme_void() +
labs(title = " Alex G Fans in the USA",fill = "Fans per
100,000 indv")
ggsave("outputs/all_us.png", height = 6, width = 8)
#North East
p + geom_polygon(color = "gray90", size = 0.1) +
coord_map(xlim = c(-80, -65), ylim = c(38,48)) +
geom_text(aes(long_center, lat_center, label = count), size = 3.3) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
theme_void() +
labs(title = " Alex G Fans in the North East",fill = "Fans per
100,000 indv")
ggsave("outputs/NE_us.png", height = 8, width = 8)
#The south
p + geom_polygon(color = "gray90", size = 0.1) +
coord_map(xlim = c(-110, -65), ylim = c(25,40)) +
geom_text(aes(long_center, lat_center, label = count), size = 3.3) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
theme_void() +
labs(title = " Alex G Fans in the South",fill = "Fans per
100,000 indv")
ggsave("outputs/south_us.png", height = 8, width = 12)
#The Mid West
p + geom_polygon(color = "gray90", size = 0.1) +
coord_map(xlim = c(-114, -80), ylim = c(32,49)) +
geom_text(aes(long_center, lat_center, label = count)) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
theme_void() +
labs(title = " Alex G Fans in the Mid West",fill = "Fans per
100,000 indv")
ggsave("outputs/midwest_us.png", height = 8, width = 9)
#The west
p + geom_polygon(color = "gray90", size = 0.1) +
coord_map(xlim = c(-125, -110), ylim = c(30,49)) +
geom_text(aes(long_center, lat_center, label = count)) +
scale_fill_gradientn(colours = c("cyan3", "gold", "orangered")) +
theme_void() +
labs(title = " Alex G Fans in the West",fill = "Fans per
100,000 indv")
ggsave("outputs/west_us.png", height = 9, width = 7)
So the analysis I decided to do here was ask “Where is Alex’s fan base the largest (when controlling for population size) in the United States?” I pulled this region data from kaggle and will use it to test our hypothesis. “Our hypothesis?” you say? Yes, obviously it is the East Coast which has the most representation, if it’s not, all hope will be lost.
(P.S. I’ve learned of the state.region data in R, but womp womp here we R, let’s just go with it)
state_region <- read.csv("inputs/states.csv", stringsAsFactors = FALSE)
us_state_region <- us_states_3 %>%
select(State = region, count, adjusted_fans = adjust, State.Code = state, pop = X2019) %>%
mutate(State = str_to_title(State)) %>% unique()
us_state_region <- left_join(us_state_region, state_region) %>%
mutate(Region = case_when(
State == "District Of Columbia" ~ "South",
TRUE ~ Region
), Division = case_when(
State == "District Of Columbia" ~ "South Atlantic",
TRUE ~ Division
))
summarised_region <- us_state_region %>%
group_by(Region) %>%
summarise(adjusted_fans = sum(adjusted_fans), fans = sum(count), pop = sum(pop)) %>%
mutate(proportion = fans/pop)
ggplot(summarised_region, aes(Region, adjusted_fans, fill = Region)) +
geom_bar(stat = "identity") +
labs(title = "Adjusted Fans Per Region",
subtitle = paste("I mean... I think the east coast still wins when you add its sections together?",
"N = ", nrow(state_dat))) +
ylab("Adjusted Fans") +
theme_classic()
prop.test(x = summarised_region$fans, n = summarised_region$pop)
##
## 4-sample test for equality of proportions without continuity
## correction
##
## data: summarised_region$fans out of summarised_region$pop
## X-squared = 3.4198, df = 3, p-value = 0.3313
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4
## 4.390522e-07 5.001536e-07 3.264839e-07 4.199483e-07
Well there doesn’t seem to be a clear winner… I’ll just tell myself if we included Southern East Coast states it would’ve won.
So one of the first graphs I generated with the survey because I thought it would be a fun one to do, is looking at the political leanings in the group. While politics has a lot more depth than can be represented by my figure, I think having two axis, left to right and libertarian to authoritarian, allows a fair amount of depth in itself. The question was phrased as follows:
“Where do you stand on the political compass? This website will calculate it, but you can do it by personal feel as well by looking at the image! https://www.politicalcompass.org/ (center at 5 and round if you get a score from the website, site ~10 minutes)”
With the below image as a guide, the image isn’t completely accurate if you ask me, but from a quick google search, it does a good enough job.
The reason that I said this is spicy is because, let’s be honest, we live in an incredibly politically charged time. And when I posted this, I completely understood people not being comfortable with individuals in the group having, uh, “strong fascist leanings” so it caused a bit of a row (Never a good look when your post has more comments than likes). That said, I still think the data is pretty interesting so let’s look at it.
## Cleaning the dataset and getting medians and means for each leaning
#I remove one fill in the blank because... yeah
political_dat <- select(dat,left_right, lib_auth, gender, age) %>% drop_na()
#%>% mutate(gender = ifelse(gender =="There are only 2 genders", "Prefer not to say", gender))
#Get the medians and means of the compass
med_LR <- median(political_dat$left_right)
med_LA <- median(political_dat$lib_auth)
mean_LR <- mean(political_dat$left_right)
mean_LA <- mean(political_dat$lib_auth)
As the data is technically a non-continuous discrete ordinal value as I asked for answers in integers this is technically one of the better ways to represent the data. The size of the circle represents the number of individuals who selected said option. We will later have more readable examples of this data later on in case you’re interested in exact numbers.
political_dat_2 <- rbind(political_dat, c(9, 8))
pol_count <- as.data.frame(table(political_dat_2$left_right, political_dat_2$lib_auth))
pol_count <- pol_count %>% mutate(Freq = ifelse((Var1 == "9" & Var2 == "8"), Freq - 1, Freq))
colnames(pol_count)[1:2] <- c("left_right", "lib_auth")
pol_count$left_right <- as.integer(as.character(pol_count$left_right))
pol_count$lib_auth <- as.integer(as.character(pol_count$lib_auth))
pol_count <- filter(pol_count, Freq != 0)
ggplot(pol_count) +
#
geom_rect(aes(xmin = 5, xmax= 10,ymin = 0, ymax=5), fill = "yellow", alpha = 0.02) +
geom_rect(aes(xmin = 0, xmax= 5,ymin = 0, ymax=5), fill = "green", alpha = 0.02) +
geom_rect(aes(xmin = 5, xmax= 10,ymin = 5, ymax=10), fill = "blue", alpha = 0.02) +
geom_rect(aes(xmin = 0, xmax= 5,ymin = 5, ymax=10), fill = "red", alpha = 0.02) +
geom_point(aes(x = left_right, y = lib_auth, size= Freq)) +
#Adding median and mean
geom_point(aes(x=med_LR, y=med_LA), colour="white", shape = 9, size =3) +
geom_point(aes(x=mean_LR, y=mean_LA), colour="red", shape = 9, size =3) +
#Add the dotplot annotation
geom_point(aes(x=8, y=3), colour="white", shape = 9, size =4) +
geom_point(aes(x=8, y=1.5), colour="red", shape = 9, size =4) +
labs(title = "666posting Political Compass", subtitle = paste0("N = ", nrow(political_dat))) +
ylab("Libertarian to Authoritarian") +
xlab("Left to Right") +
# scale_x_continuous(breaks = c(0:10))+
#scale_y_continuous(breaks = c(0:10)) +
scale_x_continuous(breaks = c(0,2,4,6,8,10))+
scale_y_continuous(breaks = c(0,2,4,6,8,10)) +
scale_size_continuous(breaks = c(1, 3, 6, 9, 13, 15, 17), range = c(2.5,8)) +
annotate("text", x = c(8, 8),
y = c(2, 3.5),
label = c("Mean", "Median"), size =4) +
annotate("text", x = c(2.5, 2.5, 7.5, 7.5),
y = c(10.5, -.5, 10.5, -.5),
label = c("Auth Left", "Lib Left", "Auth Right", "Lib Right"), size =5) +
coord_fixed() + theme_void(base_size = 14)
ggsave("outputs/political_compass_exact.png")
Second political compass scatterplot with a jitter and the shape of the points relating to gender. In case you’re wondering why some points are off of the figure that is due to the jitter, I apologize for that.
ggplot(political_dat) +
geom_rect(aes(xmin = 5, xmax= 10,ymin = 0, ymax=5), fill = "yellow", alpha = 0.005) +
geom_rect(aes(xmin = 0, xmax= 5,ymin = 0, ymax=5), fill = "green", alpha = 0.005) +
geom_rect(aes(xmin = 5, xmax= 10,ymin = 5, ymax=10), fill = "blue", alpha = 0.005) +
geom_rect(aes(xmin = 0, xmax= 5,ymin = 5, ymax=10), fill = "red", alpha = 0.005) +
geom_jitter(aes(x = left_right, y = lib_auth, shape = gender), width =.5, height =.5, alpha = .5, ) +
#Adding median and mean
geom_point(aes(x=med_LR, y=med_LA), colour="blue", shape = 9, size =2) +
geom_point(aes(x=mean_LR, y=mean_LA), colour="red", shape = 9, size =2) +
#Add the dotplot annotation
geom_point(aes(x=8, y=3), colour="blue", shape = 9, size =3) +
geom_point(aes(x=8, y=1.5), colour="red", shape = 9, size =3) +
#Add annotations to the compass
labs(title = "666posting Political Compass", subtitle = paste0("N = ", nrow(political_dat))) +
ylab("libertarian to authoritarian") +
xlab("left to right") +
coord_fixed() + annotate("text", x = c(2.5, 2.5, 7.5, 7.5),
y = c(10.5, -.5, 10.5, -.5),
label = c("Auth Left", "Lib Left", "Auth Right", "Lib Right"), size =5) +
annotate("text", x = c(8, 8),
y = c(2, 3.5),
label = c("Mean", "Median"), size =4) +
theme_void(base_size = 14)
ggsave("outputs/political_compass_gender.png")
The same scatterplot as above, but this time the dots are colored on the age of the individuals. I had to remove the individual over 40 because that threw the scale off even when I used a log scale, so my apologies to them.
#make age normalized because of skewing
political_dat_age <- political_dat %>% filter(age < 40)
#Age
ggplot(political_dat_age) +
geom_rect(aes(xmin = 5, xmax= 10,ymin = 0, ymax=5), fill = "yellow", alpha = 0.005) +
geom_rect(aes(xmin = 0, xmax= 5,ymin = 0, ymax=5), fill = "green", alpha = 0.005) +
geom_rect(aes(xmin = 5, xmax= 10,ymin = 5, ymax=10), fill = "blue", alpha = 0.005) +
geom_rect(aes(xmin = 0, xmax= 5,ymin = 5, ymax=10), fill = "red", alpha = 0.005) +
geom_jitter(aes(x = left_right, y = lib_auth, color = age), width =.5, height =.5 ) +
#Adding median and mean
geom_point(aes(x=med_LR, y=med_LA), colour="blue", shape = 9, size =2) +
geom_point(aes(x=mean_LR, y=mean_LA), colour="red", shape = 9, size =2) +
labs(title = "666posting Political Compass", subtitle = paste0("N = ", nrow(political_dat))) +
ylab("libertarian to authoritarian") +
xlab("left to right") +
coord_fixed() + annotate("text", x = c(2.5, 2.5, 7.5, 7.5),
y = c(10.5, -.5, 10.5, -.5),
label = c("Auth Left", "Lib Left", "Auth Right", "Lib Right"), size =5) +
scale_colour_gradient(low = "black", high = "white", na.value = NA) +
theme_void()
Let us see if we can break this data up into the different quadrants, including wiggle room in the middle for centrists. Again its a scale of 0-10, so I’m gonna have 4-6 as centrist.
political_dat <- political_dat %>% mutate(quadrant = case_when(
left_right <= 3 & lib_auth <= 3 ~ "Libertarian Left",
left_right <= 3 & lib_auth >= 7 ~ "Authoritarian Left",
left_right >= 7 & lib_auth <= 3 ~ "Libertarian Right",
left_right >= 7 & lib_auth >= 7 ~ "Authoritarian Right",
left_right >= 4 & left_right <= 6 & lib_auth >= 4 & lib_auth <= 6 ~ "Centrist",
left_right >= 4 & left_right <= 6 & lib_auth <= 3 ~ "Libertarian Center",
left_right >= 4 & left_right <= 6 & lib_auth >= 7 ~ "Authoritarian Center",
lib_auth >= 4 & lib_auth <= 6 & left_right >= 7 ~ "Center Right",
lib_auth >= 4 & lib_auth <= 6 & left_right <= 3 ~ "Center Left"
), quadrant = factor(quadrant, levels = c("Libertarian Left",
"Authoritarian Left",
"Center Left",
"Libertarian Center",
"Centrist",
"Authoritarian Center",
"Libertarian Right",
"Center Right",
"Authoritarian Right"
)))
ggplot(political_dat, aes(quadrant, age, fill = quadrant)) +
geom_violin() +
geom_sina(alpha = .5) +
scale_fill_brewer(palette = "Set1") +
stat_n_text(size = 3) +
theme_classic() +
theme(legend.position = "none") +
labs(
title = "Quadrants compared to Age",
subtitle = paste0("N = ", nrow(political_dat))
) +
xlab("Quadrant") +
ylab("Age") + coord_flip()
While this plot is interesting, I don’t think there is a good way to test these groups because some of them have very few individuals, however there is another way we can look at this.
ggplot(political_dat, aes(x = factor(left_right, 0:10), y = age, fill = factor(left_right, 0:10) )) +
geom_violin(width = 2 ) +
geom_sina(alpha = .5) +
scale_fill_brewer(palette = "Set1") +
stat_n_text(size = 3) +
theme_classic() +
theme(legend.position = "none") +
labs(
title = "Left to Right compared to Age",
subtitle = paste0("N = ", nrow(political_dat))
) +
xlab("Left to Right (0-10)") +
ylab("Age") + coord_flip()
broom::tidy(lm(age ~ left_right, data = political_dat))
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 23.2 0.414 56.1 6.70e-124
## 2 left_right 0.0375 0.139 0.270 7.88e- 1
Well that’s not significant, how about being libertarian or authoritarian?
ggplot(political_dat, aes(x = factor(lib_auth, 0:10), y = age, fill = factor(lib_auth, 0:10) )) +
geom_violin(width = 1) +
geom_sina(alpha = .5) +
scale_fill_brewer(palette = "Set1") +
stat_n_text(size = 3) +
theme_classic() +
theme(legend.position = "none") +
labs(
title = "Lib to Auth compared to Age",
subtitle = paste0("N = ", nrow(political_dat))
) +
xlab("Lib to Auth (0-10)") +
ylab("Age") + coord_flip()
tidy(lm(age ~ lib_auth, data = political_dat))
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 23.5 0.467 50.3 4.33e-115
## 2 lib_auth -0.0644 0.124 -0.519 6.04e- 1
Well, I guess I can say, specifically in this 666posting cohort, your age doesn’t seem to influence your political leanings. Sorry, that wasn’t interesting, but again there are some groupings with very few samples so that didn’t help either and I didn’t remove the outliers.
Another question asked is if you play instruments or not, and if so which ones?
Preparing the data and make a melted version of the dataframe
band_dat <- dat %>% mutate(id = row_number(),
instruments = gsub("Music Software \\(eg. Ableton, FL studio\\)",
"Music Software",
instruments),
instruments = str_replace_all(instruments,
"a little bit of chinese flute",
"Chinese Flute")) %>%
filter(!(str_detect(instruments, "anymore")) |
instruments != "Nope" |
str_detect(instruments, "not")) %>%
select(id, instruments, in_band)
band_dat_melt <- band_dat %>%
mutate(instrument = (str_split(instruments, ","))) %>%
unnest(cols = instrument) %>%
mutate(instrument =(str_trim(instrument, "both"))) %>%
filter(!(instrument %in% c("Used to","but not anymore", "", "Nope")))
#Check data looks okay
unique(band_dat_melt$instrument)
## [1] "Guitar" "Banjo" "Voice" "Accordion"
## [5] "Bass" "Violin/Viola" "Percussion/Drums" "Piano/Keyboard"
## [9] "Brass" "Music Software" "Cello" "Mandolin"
## [13] "Sampler" "Woodwind" "Ukulele" "omnichord"
## [17] "Chinese Flute" "clarinet" "synthesiser"
#Make dataframe giving a 1 or 0 to each person based on if they play the instrument or not, 1 is TRUE and 0 is false
instr_dat <- (as.data.frame.matrix(table(band_dat_melt$id, band_dat_melt$instrument)))
Generate general summaries to be used in the first plot
#General stats
sum_instr <- as.data.frame(table(band_dat_melt$instrument),
stringsAsFactors = F) %>%
mutate(Var1 = str_to_title(Var1))
Plot of instruments played in 666posting
ggplot(sum_instr, aes(x = reorder(Var1, -Freq), y = Freq, fill = Var1)) +
geom_bar(stat= "identity") +
labs(title = "Instruments played in this group",
subtitle = paste0("N = ", length(unique(band_dat_melt$id)))) +
scale_y_continuous(limits = c(0, 125)) +
ylab(label = "Count") +
xlab(label = "Instruments") +
theme_classic(base_size = 13) +
theme(axis.text.x = element_text(angle=85, hjust=1, vjust = 1),
legend.position = "None") +
geom_text(aes(label=Freq),
position=position_dodge(width=0.9),
vjust=-0.25)
ggsave("outputs/instrument_summary.png", width = 8, height = 8)
Okay, here I’m about to go on a visualization tirade. So I’m sure you’re used to Venn diagrams, but have probably not heard of Upset plots. So Venn diagrams are the classic way to represent count crossover between multiple groups, but when groups are large this generally reduces to circles with numbers inside of them. This doesn’t do the scale of the differences proper justice, and people, in general, have a better time understanding the magnitude of difference when there is a visualization, especially when comparing sizes of adjacent bars. As a result, a new plot has been formed called Upset plots.
I will explain how to read them now. An Upset plot is comprised of two bar plots that relate to the central image with balls and lines. The barplot on the “y-axis” of the center image is the total count for how many people play each instrument. If we look at the center image, you can see each row is labeled by an instrument. Each column however has different dots representing each instrument. When a dot is filled that shows an overlap between the two instruments, the amount of those overlaps is what is counted in the “x-axis” barplot above. So one plot gives you the total number of people who play a certain instrument, while the other plot gives you how many people share the same overlap of instruments. I hope this makes, sense, but it will be readily apparent once you see the plots.
Anyways the way to read this diagram is as follows: The total count of each instrument is in the bottom left corner, you can see that we have over 100 guitarists who answered the survey followed by ~60 keyboard/piano players. Then if you look to the right of that graph you’ll see dots and lines. If two dots are filled that means that the bar above it represents the intersection of individuals who play both those instruments. As more dots are filled that represents a greater intersection. So you can see that there are 25 solo guitarists, while there are 2 people who play all of the instruments in this graph.
As I said, I couldn’t fit in all the data because of the limitations so sorry to all the unique instrument players, but at least you got a shout out in the first graph you Chinese flute-playing god.
#Make Instrument strings look like titles
colnames(instr_dat) <- str_to_title(colnames(instr_dat))
alex_musicians = nrow(instr_dat %>% filter(Banjo == 1, Bass == 1, Guitar == 1, Mandolin == 1, `Music Software` == 1, `Percussion/Drums` == 1, `Piano/Keyboard` == 1, Voice == 1))
instru = 1
instrument_list <- list()
#Probably should change this to a lapply, but we use this loop to generate a list of ids that play a given instrument
for (instru in 1:ncol(instr_dat)) {
instrument_list[[colnames(instr_dat)[instru]]] <- (1:nrow(instr_dat))[instr_dat[,instru] == 1]
}
plot_up <- upset(fromList(instrument_list),
order.by = "freq",
nsets = 6,
#point.size = 4,
#line.size = 1.8,
text.scale=c(2.8, 1.6, 1.8, 1.6, 2, 2),
mainbar.y.label = "Instrument Intersections",
sets.x.label = "People Per Instrument"
)
plot_up
plot_ups <- upset(fromList(instrument_list),
order.by = "freq",
nsets = 30,
#point.size = 2.8,
#line.size = 1,
text.scale=c(2.8, 1.7, 1.8, 1.6, 1.5, 2),
mainbar.y.label = "Instrument Intersections",
sets.x.label = "People Per Instrument"
)
plot_ups
Fun fact there are 2 who play the same instruments as Alex, including the guitar, bass, drums, banjo, voice, mandolin, music software, and keyboard (to my knowledge).
#muted for final readout
# png("outputs/instrument_plot.png", width = 800, height = 800)
# plot_ups
# dev.off()
#
# png("outputs/instrument_5_plot.png", width = 800, height = 800)
# plot_up
# dev.off()
One last quick question about our instrumentalists, are you in a band?
band_yn <- band_dat_melt %>% select(id, in_band) %>% unique()
ggplot(band_yn, aes(in_band)) +
geom_bar() + theme_classic() +
ylab("Count of individuals") +
xlab("Are you in a band?") +
labs(title = "Are 666posting musicians in a band?",
subtitle = paste("N = ", nrow(band_yn)))
Frankly, I expected “No” to be the dominant answer, but I am surprised by how good a fight “Yes” put up. Maybe that’s my personal bias of knowing more musicians who aren’t in bands.
This is basically the same analysis as above, but this time we’re looking at the rest of the world of The Arts!
art_dat <- dat %>% filter(art != "") %>%
select(Timestamp, art)
art_dat_melt <- art_dat %>%
mutate(art = (str_split(art, ","))) %>%
unnest(cols = art) %>%
mutate(art =(str_trim(art, "both"))) %>%
filter(!(art %in% c("Used to","but not anymore", "", "Nope")))
#Check data looks okay
#unique(band_dat_melt$art)
art_dat <- (as.data.frame.matrix(table(art_dat_melt$Timestamp, art_dat_melt$art)))
#General stats
sum_art <- as.data.frame(table(art_dat_melt$art),
stringsAsFactors = F) %>%
mutate(Var1 = str_to_title(Var1))
Plot of art mediums in 666posting
ggplot(sum_art, aes(x = reorder(Var1, -Freq), y = Freq, fill = Var1)) +
geom_bar(stat= "identity") +
labs(title = "Art mediums used in this group",
subtitle = paste0("N = ", nrow(art_dat))) +
scale_y_continuous(limits = c(0, 75)) +
ylab(label = "Count") +
xlab(label = "Art medium") +
theme_classic(base_size = 13) +
theme(axis.text.x = element_text(angle=85, hjust=1, vjust = 1),
legend.position = "None") +
geom_text(aes(label=Freq),
position=position_dodge(width=0.9),
vjust=-0.25)
ggsave("outputs/art_summary.png", width = 8, height = 8)
#Prepare art upset plot
#Make Instrument strings look like titles
colnames(art_dat) <- str_to_title(colnames(art_dat))
art_hold = 1
art_list <- list()
#Probably should change this to a lapply, but we use this loop to generate a list of ids that play a given instrument
for (art_hold in 1:ncol(art_dat)) {
art_list[[colnames(art_dat)[art_hold]]] <- (1:nrow(art_dat))[art_dat[,art_hold] == 1]
}
art_upset <- upset(fromList(art_list),
order.by = "freq",
nsets = 6,
#point.size = 4,
#line.size = 1.8,
text.scale=c(2.8, 1.6, 1.8, 1.6, 2, 2),
mainbar.y.label = "Art Medium Intersections",
sets.x.label = "People Per Medium"
)
art_upset
So… word clouds. Technically speaking word clouds are never a good way to represent data in a meaningful way. While the size of the letters increases one can still be tricked into thinking, in this case, long band names have more votes, which isn’t true. As a result, you’ll see that I ended up coloring the data a little bit to help clarify this visual discrepancy.
musician_dat <- dat %>%
select(Timestamp, favorite_musician, sec_musician = second_favorite_musician, music_genre) %>%
#Mutate all names to lower first and trim whitesppace
mutate(favorite_musician = str_trim(tolower(favorite_musician)),
sec_musician = str_trim(tolower(sec_musician)),
#Fix all spelling of musician names that stand out to me
favorite_musician = case_when(
favorite_musician == "elliot smith" ~ "elliott smith",
str_detect(favorite_musician, "suicide") ~ "teen suicide",
str_detect(favorite_musician, "title") ~ "title fight",
T ~ favorite_musician),
sec_musician = case_when(
favorite_musician == "elliot smith" ~ "elliott smith",
str_detect(favorite_musician, "suicide") ~ "teen suicide",
T ~ sec_musician
),
#Clean up alex g titles to new/old official spelling and string to title for presentation
favorite_musician = ifelse(favorite_musician == "(sandy) alex g", "Alex G", str_to_title(favorite_musician)),
sec_musician = ifelse(sec_musician == "(sandy) alex g", "Alex G", str_to_title(sec_musician))) %>%
#clean all blanks
filter(favorite_musician != "", sec_musician != "", music_genre != "")
fave_musician <- as.data.frame(table(musician_dat$favorite_musician), stringsAsFactors = F)
sec_musician <- as.data.frame(table(musician_dat$sec_musician), stringsAsFactors = F)
color_artist <- brewer.pal(n = 3, name = "Dark2")
fave_musician <- fave_musician %>% mutate(color_fan = case_when(
Freq >= 29 ~ color_artist[1],
Freq > 1 ~ color_artist[2],
T ~ "black"
))
sec_musician <- sec_musician %>% mutate(color_fan = case_when(
Freq >= 29 ~ color_artist[1],
Freq > 1 ~ color_artist[2],
T ~ "black"
))
#Word cloud for markdown
wordcloud(words = fave_musician$Var1, freq = fave_musician$Freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.30,
colors=fave_musician$color_fan, ordered.colors = T,
scale=c(5,.666)
)
#Print word cloud
## Silented printing for final release
# png("outputs/favorite_musician.png", units="in", width=5, height=5, res=300)
# wordcloud(words = fave_musician$Var1, freq = fave_musician$Freq, min.freq = 1,
# max.words=200, random.order=FALSE, rot.per=0.35,
# colors=fave_musician$color_fan, ordered.colors = T
# )
# dev.off()
If they have more than one fan, the name is in orange.
top_fave <- fave_musician %>% arrange(desc(Freq)) %>% head(11)
top_fave_nog <- top_fave[-1,]
ggplot(top_fave_nog, aes(x = factor(Var1, rev(top_fave_nog$Var1)), y = Freq)) +
geom_bar(stat = "identity") + coord_flip() +
xlab("Top Favorite Musicians") +
ylab("Count") +
labs(title ="Top Ten Favorite Non-Alex G Artists") + theme_classic()
Alex has 80 fans by the way.
#Word cloud for markdown
wordcloud(words = sec_musician$Var1, freq = sec_musician$Freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=sec_musician$color_fan, ordered.colors = T,
scale=c(5,0.5)
)
#Print word cloud
png("outputs/second_musician.png", units="in", width=5, height=5, res=300)
wordcloud(words = sec_musician$Var1, freq = sec_musician$Freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=sec_musician$color_fan, ordered.colors = T
)
dev.off()
## png
## 2
sec_fave <- sec_musician %>% arrange(desc(Freq)) %>% head(11)
sec_fave_nog <- sec_fave[-1,]
ggplot(sec_fave_nog, aes(x = factor(Var1, rev(sec_fave_nog$Var1)), y = Freq)) +
geom_bar(stat = "identity") + coord_flip() +
xlab("Top Secibd Favorite Musicians") +
ylab("Count") +
labs(title ="Top Ten Second Favorite Non-Alex G Artists") + theme_classic()
Alex has 33 individuals who consider him their second favorite artist.
genre_dat <- musician_dat %>%
mutate(music_genre = str_replace(str_to_lower(str_trim(music_genre)), "-", " ")) %>%
filter(music_genre != "") %>%
mutate(music_genre = case_when(
music_genre %in% "imdie rock" ~ "indie rock",
music_genre %in% "lofi / 'indie rock' i guess lol" ~ "lofi",
music_genre %in% "indite rock" ~ "indie rock",
music_genre %in% "indie rock / folk / jazz" ~ "jazz",
music_genre %in% "indie rock, punk, alternative." ~ "punk",
music_genre %in% "shoegaze/indie rock" ~ "shoegaze",
T ~ music_genre
),
music_genre = str_to_title(music_genre))
genre_table <- data.frame(table(genre_dat$music_genre), stringsAsFactors = F)
genre_table <- genre_table %>% mutate(color_fan = case_when(
Freq >= 30 ~ color_artist[1],
Freq > 1 ~ color_artist[2],
T ~ "black"
))
#Word cloud for markdown
wordcloud(words = genre_table$Var1, freq = genre_table$Freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.30,
colors=genre_table$color_fan, ordered.colors = T,
scale=c(4,1)
)
# #Print word cloud
# png("outputs/favorite_genre.png", units="in", width=5, height=5, res=300)
# wordcloud(words = genre_table$Var1, freq = genre_table$Freq, min.freq = 1,
# max.words=200, random.order=FALSE, rot.per=0.30,
# colors=genre_table$color_fan, ordered.colors = T
# )
# dev.off()
genre_fave <- genre_table %>% arrange(desc(Freq)) %>% head(10)
ggplot(genre_fave, aes(x = factor(Var1, rev(genre_fave$Var1)), y = Freq)) +
geom_bar(stat = "identity") + coord_flip() +
xlab("Top genreibd Favorite Musicians") +
ylab("Count") +
labs(title ="Top Ten genreond Favorite Non-Alex G Artists") + theme_classic()
music_summary <- t(as.matrix(table(dat$music_method)))
music_summary <- t(as.matrix(music_summary[,-1]))
music_summ_table <- ggtexttable(music_summary)
First, we gotta do some MASSIVE data cleaning, I really should not have left this open for people to put whatever they want, look at all the unique comments, as of writing there is 37. So we gotta find a way to generalize these. That said, some are very interesting.
unique(dat$discover_method)
## [1] "Through a friend"
## [2] "Spotify/Apple Suggested Music"
## [3] "I know him personally"
## [4] "DIY show "
## [5] "Youtube"
## [6] "Press"
## [7] "Tumblr"
## [8] "College Radio"
## [9] "Reddit"
## [10] "Compilation Albums"
## [11] "I read an article from Pitchfork about the release of DSU"
## [12] "Suggested by another artist I listen to"
## [13] "Pitchfork"
## [14] "SoundCloud"
## [15] "Toured Together"
## [16] "Crush at the time"
## [17] "another FB group"
## [18] ""
## [19] "4chan"
## [20] "Soundcloud"
## [21] "saw his name on a festival lineup"
## [22] "Soundcloud recommendation"
## [23] "Flaked on Netflix"
## [24] "A girl in twitter who made playlists"
## [25] "Record store clerk"
## [26] "through music blogs like NME or pitchfork"
## [27] "Radio"
## [28] "Nintendo 64 cover on a Ztapes compilation"
## [29] "pitchfork's rocket review, sorry!"
## [30] "8tracks.com"
## [31] "Wikipedia"
## [32] "Festival Lineup"
## [33] "Music journalists/festivals"
## [34] "Saw him perform at a festival"
## [35] "Music blog"
## [36] "Facebook"
## [37] "He performed at a festival I attended"
## [38] "lofi record labels (specifically birdtapes and orchid tapes)"
## [39] "From a vine"
## [40] "Through facebook friends"
## [41] "tumblr"
## [42] "Through my brother (maybe this comes under friend?)"
## [43] "420 Love Songs Compilation (wasnt sure if you meant Alex G compilations sorry!)"
## [44] "Through the Spotify playlists of a YouTuber I like"
## [45] "i really canâ\200\231t remember "
## [46] "radio podcast"
## [47] "at a concert (Living Bread, brooklyn, may 2013)"
## [48] "was the support at a show i was at "
## [49] "Flake"
## [50] "GTA V lol"
If you’re viewing the version without the code you might not know what is going on here, but basically I am using key words in all of the non-normal options to fit them into more normal categories. For example if the answer includes a website, I detect the unique letters in that character string (“umblr” fo Tumblr) and then rename that answer to “other website”. This is a more interesting section to look at via the coded version of this file.
#I'm trying to avoid making all lower and then upper
websites <- c("umblr", "ound", ".com","iki", "vine", "acebook", "FB", "blog", "chan", "You", "eddit", "etflix")
online_personality <- c("witter", "artist ZI", "ouTuber")
friends <- c("other", "Crush", "clerk")
radio <- "adio"
journalism <- c("itchfork", "ress", "review")
live_music <- c("estival", "oncert", "oured", "show")
compilations_lab <- c("ompilation", "label")
dat_disc <- mutate(dat, discover_method = case_when(
str_detect(discover_method, paste(websites, collapse = "|")) ~ "Other Website",
str_detect(discover_method, paste(online_personality, collapse = "|")) ~ "Online Personality",
str_detect(discover_method, paste(journalism, collapse = "|")) ~ "Journalism",
str_detect(discover_method, paste(friends, collapse = "|")) ~ "Through a friend",
str_detect(discover_method, paste(live_music, collapse = "|")) ~ "Live Music",
str_detect(discover_method, paste(radio, collapse = "|")) ~ "Radio",
str_detect(discover_method, paste(compilations_lab, collapse = "|")) ~ "Compilations/Label",
T ~ discover_method
)) %>% filter(!discover_method %in% c("", "i really can’t remember ")) %>%
select(Timestamp, discover_method)
unique(dat_disc$discover_method)
## [1] "Through a friend" "Spotify/Apple Suggested Music"
## [3] "I know him personally" "Live Music"
## [5] "Other Website" "Journalism"
## [7] "Radio" "Compilations/Label"
## [9] "Online Personality" "Flake"
## [11] "GTA V lol"
Making the plot now simplified including a table of how we listen to our music
discover_g <- ggplot(dat_disc, aes(x = discover_method)) +
geom_bar() +
coord_flip() +
xlab("How did you discover Alex G") +
geom_text(stat='count', aes(label=..count..), nudge_y = 1.8) +
labs(title = "How did we discover Alex G?",
subtitle = paste("N =", nrow(dat_disc))) +
theme_classic()
grob_mus <- ggarrange(discover_g, music_summ_table, nrow = 2, heights = c(1, .3))
annotate_figure(grob_mus, bottom = "How we listen to music")
Now we’re getting into that good Alex G data, favorite albums, favorite songs, and more!
##Filter down to albums, songs, and music videos to make an object usable in the next couple chunks
songs_dat <- dat %>%
select(Timestamp,
favorite_song,
favorite_album,
unreleased_album,
music_vid) %>%
mutate(music_vid = str_to_title(music_vid),
music_vid = ifelse(str_detect(music_vid, "Poison"), "Poison Root", music_vid))
A barplot of favorite released Alex G albums
album_plot <- ggplot(songs_dat, aes(x = forcats::fct_infreq(favorite_album)),
fill = forcats::fct_infreq(favorite_album)) +
geom_bar(aes(fill = forcats::fct_infreq(favorite_album))) +
scale_fill_manual(
values = c("deepskyblue1",
"royalblue4",
"springgreen4",
"deeppink1",
"red3",
"darkorchid4",
"black",
"grey68",
"yellow3"
)
) +
theme_classic(base_size = 14) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.title = element_blank()) +
labs(title = "Favorite Alex G Album",
subtitle = paste0("N = ",nrow(songs_dat))) +
geom_text(stat='count', aes(label=..count..), nudge_y = 2.5)
album_plot
ggsave("outputs/favorite_album.png", height = 7, width = 7)
A barplot of favorite Alex G Fan compilations, also for those of you that don’t listen to his unreleased stuff, you should give it a listen.
fan_comp_plot <- ggplot(songs_dat, aes(x = forcats::fct_infreq(unreleased_album))) +
geom_bar(aes(fill = forcats::fct_infreq(unreleased_album))) +
theme_classic(base_size = 14) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.title = element_blank()) +
labs(title = "Favorite Alex G Fan Compilation",
subtitle = paste0("N = ",nrow(songs_dat))) +
geom_text(stat='count', aes(label=..count..), nudge_y = 2.5)
fan_comp_plot
#Save it
ggsave("outputs/favorite_fancomp.png", height = 7, width = 7)
Overall it seems that people don’t listen to them, but as I said I suggest it. The winner of them is by a large margin Monsterhead, which makes sense to me. Out of all his compilations, I feel it is the most stacked with well known unreleased tracks (Nintendo 64, Uh, Written in Blood, etc.)
music_vid_plot <- ggplot(songs_dat, aes(x = forcats::fct_infreq(music_vid))) +
geom_bar(aes(fill = forcats::fct_infreq(music_vid))) +
theme_classic(base_size = 14) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.title = element_blank()) +
labs(title = "Favorite Alex G Music Video",
subtitle = paste0("N = ",nrow(songs_dat))) +
geom_text(stat='count', aes(label=..count..), nudge_y = 2.5)
music_vid_plot
#Save it
ggsave("outputs/favorite_music_vid.png", height = 7, width = 7)
I mean, Gretel’s video is really good
## Prepare song data for plotting
#Read in songs for albums, A CSV I made.
album_songs <- read.csv("inputs/alex_g_album_songs.csv", stringsAsFactors = F)
album_songs <- album_songs %>% mutate_all(str_to_title)
#Fixing various naming issues
songs_dat <- songs_dat %>% mutate(favorite_song = case_when(
str_detect(favorite_song, "Bonus") ~ "Good",
str_detect(favorite_song, "House") ~ "Sugarhouse",
TRUE ~ favorite_song
),
favorite_song = str_to_title(str_trim(favorite_song, side = "both")))
#Get count of songs
songs_only <- as.data.frame(table(songs_dat$favorite_song), stringsAsFactors = F)
#Color songs based on album source
songs_only <- songs_only %>% mutate(color = case_when(
Var1 %in% album_songs$house_of_sugar ~ "darkorchid4",
Var1 %in% album_songs$rocket_ ~ "red3",
Var1 %in% album_songs$beach_music ~ "royalblue4",
Var1 %in% album_songs$dsu ~ "springgreen4",
Var1 %in% album_songs$trick ~ "deepskyblue1",
Var1 %in% album_songs$rules ~ "black",
Var1 %in% album_songs$easy ~ "lightseagreen",
Var1 %in% album_songs$race ~ "deeppink1",
Var1 %in% album_songs$winner ~ "yellow4",
TRUE ~ "gray34"
))
#Clean up data
songs_clean <- songs_only %>%
arrange(desc(Freq)) %>%
filter(Var1 != "")
A word cloud of everyone’s favorite Alex G songs from the survey, colored by what album they’re in, size based on the number of people that selected a given song as their favorite. So as you can see the top two favorite songs of those who answered the survey is Snot and Gnaw, and tbh, I’m SNOT surprised. Ugh, anyways the thing I actually love about this image is, because of how Alex names his songs, there are cool phrases that are generated. Personally I love “Screwy People Change My Mind”, it’s just fun sticking his song titles together.
#Word cloud for markdown
wordcloud(words = songs_only$Var1, freq = songs_only$Freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=songs_only$color, ordered.colors = T)
# #Print word cloud
# png("outputs/favorite_song_cloud.png", units="in", width=5, height=5, res=300)
# wordcloud(words = songs_only$Var1, freq = songs_only$Freq, min.freq = 1,
# max.words=200, random.order=FALSE, rot.per=0.35,
# colors=songs_only$color, ordered.colors = T)
# dev.off()
A bar plot of the top 10 favorite Alex G songs, just to ascribe hard numbers to the data, which is the issue with word clouds as said before.
top_ten_songs <- head(songs_clean, 10)
ggplot(top_ten_songs, aes(x = reorder(Var1, -Freq), Freq)) +
geom_col(fill = top_ten_songs$color) +
labs(
title = "Top 10 Favorite Alex G Songs",
subtitle = paste("Total N =", sum(songs_clean$Freq), "; In graph N =", sum(top_ten_songs$Freq))
) + ylab("count") +
theme_classic(base_size = 14) +
theme(axis.title.x=element_blank(),
axis.text.x = element_text(angle = 90, hjust = 0))
ggsave("outputs/top_10_songs.png", height = 7, width = 7)
Looks like Gnaw is our winner!
A quick question I wanted to ask is: “Is it more likely that your favorite song is on your favorite album?” Which is what I did below. I made sure to only include favorite songs on specific albums as with unreleased material stuff gets dicey. I ran a chi-square, which has the null hypothesis “There should be no correlation between one’s favorite song and their favorite album” meaning that the count should be split between “Yes, my favorite song is on my favorite album” and “No, my favorite song is not on my favorite album” and it turns out….
### Make chi-square analysis
songs_dat_test <- songs_dat %>% mutate(favorite_album_csv = case_when(
favorite_album %in% "Rocket" ~ "rocket_",
TRUE ~ str_replace_all(str_to_lower(favorite_album), " ", "_")),
#I don't know if there is a better way to do below, because I can't use favorite_album_csv to subset
fav_song_fav_album = case_when(
favorite_song %in% album_songs$house_of_sugar & favorite_album_csv == "house_of_sugar" ~ T,
favorite_song %in% album_songs$rocket_ & favorite_album_csv == "rocket_" ~ T,
favorite_song %in% album_songs$beach_music & favorite_album_csv == "beach_music" ~ T,
favorite_song %in% album_songs$dsu & favorite_album_csv == "dsu" ~ T,
favorite_song %in% album_songs$trick & favorite_album_csv == "trick" ~ T,
favorite_song %in% album_songs$rules & favorite_album_csv == "rules" ~ T,
favorite_song %in% album_songs$easy & favorite_album_csv == "easy" ~ T,
favorite_song %in% album_songs$winner & favorite_album_csv == "winner" ~ T,
favorite_song %in% album_songs$race & favorite_album_csv == "race" ~ T,
T ~ F
)
) %>% filter(favorite_song %in% as.vector(unlist(album_songs)), favorite_song != "")
table(songs_dat_test$fav_song_fav_album)
##
## FALSE TRUE
## 112 64
favorite_song_chi <- chisq.test(table(songs_dat_test$fav_song_fav_album))
favorite_song_chi
##
## Chi-squared test for given probabilities
##
## data: table(songs_dat_test$fav_song_fav_album)
## X-squared = 13.091, df = 1, p-value = 0.0002967
It is more likely that you will not have your favorite song in your favorite album, statistically speaking. Frankly if you compare favorite songs to favorite albums I’m just gonna blame Gnaw for this one.
I would first like to thank a friend of mine, Stephanie Y., who looked over this markdown for me and suggested this package.
You might be wondering, “what is spotifyr?”, and to put it simply, it is an R package that allows me to effectively use Spotify’s API to get data about Alex G and his music. While I could do a lot with this data, what I’m most interested in is the different “qualities” that Spotify attributes to various songs. This is done using machine-learning on the songs to determine their “valence” (sad to happy; 0 to 1), danceability (0 to 1), energy (0 to 1), etc. I’m going to make tables showing the overall top and bottom songs for each of these categories, and honestly, I’m dubious a. hell about these. For example, they have a quality known as “liveness” which determine if the song is recorded live or not. The number 1 “liveness” song is Brick and Sugar House - LIVE, is third! We also have loudness in decibels which goes from Clouds (thanks Luke) to Brick. But if you want to know more click here, its interesting anyways.
note: All scales are from 0-1 except loudness which is decibels.
note 2: PUT RACE ON SPOTIFY PLEASE
spot_cred <- read.csv("inputs/spotifyr_dat.csv", fileEncoding = "UTF-8-BOM")
Sys.setenv(SPOTIFY_CLIENT_ID = spot_cred$client_id)
Sys.setenv(SPOTIFY_CLIENT_SECRET = spot_cred$shhh)
access_token <- get_spotify_access_token()
Generating tables with the top 5 and bottom 5 tracks for a select handful of Spotify qualities.
alex_g <- get_artist_audio_features('alex g')
#Remove duplicate tracks
alex_g <- alex_g %>% select(-album_id, -track_id, -analysis_url, - available_markets,
-track_href, -track_preview_url, -track_uri, -external_urls.spotify,
-album_images) %>% distinct() %>% filter(!str_detect(track_name, "onus"))
top_bot_n <- function(df, col, n = 10){
col <- enquo(col)
top <- df %>% arrange(desc(!!col)) %>% slice_head(n = n)
bot <- df %>% arrange(desc(!!col)) %>% slice_tail(n = n)
return(rbind(top,bot))
}
energy_tab <- top_bot_n(alex_g, energy, n = 5) %>% select(`Energy Song Name` = track_name,Energy = energy)
danceability_tab <- top_bot_n(alex_g, danceability, n = 5) %>% select(`Danceability Song Name` = track_name, Danceability = danceability)
valence_tab <- top_bot_n(alex_g, valence, n = 5) %>% select(`Valence Song Name` = track_name, Valence = valence)
acousticness_tab <- top_bot_n(alex_g, acousticness, n = 5) %>% select(`Acousticness Song Name` = track_name, Acousticness = acousticness)
liveness_tab <- top_bot_n(alex_g, liveness, n = 5) %>% select(`Liveness Song Name` = track_name, Liveness = liveness)
loudness_tab <- top_bot_n(alex_g, loudness, n = 5) %>% select(`Loudness Song Name` = track_name, Loudness = loudness)
ggtexttable(cbind(energy_tab, valence_tab))
ggtexttable(cbind(danceability_tab, loudness_tab))
Now that you’ve probably drawn your own opinions about the validity of this data (reminder: these are the top 5 and bottom 5 songs per song quality) let’s look over Alex’s albums and then we’ll see how this intersects with our music.
Technically not the best way to represent the data but…..
## Add plotting colors
alb_df <- data.frame(factor(c("House of Sugar", "Rocket", "Beach Music", "Dsu", "Trick", "Rules")),
c("darkorchid4", "red3", "royalblue4", "springgreen4", "deepskyblue1", "black"))
colnames(alb_df) <- c("album_names", "album_color")
alex_g$album_name <- factor(alex_g$album_name, alb_df$album_names)
ggplot(alex_g, aes(x = valence, y= ..scaled.., fill = album_name)) +
geom_density() +
facet_grid(album_name~.) +
scale_fill_manual(values = alb_df$album_color) +
theme_classic() +
theme(legend.position = "none") +
ylab("Scaled Density (Valence aka Happiness)") +
labs(title = "Valence as an emotional rollercoaster for each album")
ggplot(alex_g, aes(x = track_number, y= valence, fill = album_name)) +
geom_bar(stat = "identity") +
facet_grid(album_name~.) +
scale_fill_manual(values = alb_df$album_color) +
scale_x_continuous(breaks = 1:16)+
theme_classic() +
ylab("Valence (0-1)") +
xlab("Track Number") +
theme(legend.position = "none") +
labs(title = "Valence barplot for each Alex G album")
note: Remember, the value is in decibels, the closer to 0 the louder it is
ggplot(alex_g, aes(x = track_number, y= loudness, fill = album_name)) +
geom_bar(stat = "identity") +
facet_grid(album_name~.) +
scale_fill_manual(values = alb_df$album_color) +
scale_x_continuous(breaks = 1:16) +
theme_classic() +
ylab("Loudness (Decibels)") +
xlab("Track Number") +
theme(legend.position = "none")+
labs(title = "Loudness barplot for each Alex G album")
I find the overall increase of volume over his career interesting for two reasons. First, Alex mixed and mastered his songs until Beach Music, which was engineered by someone at Domino Records, and the Domino albums seem to have more consistent volume level. Second, it is known that overtime music has only gotten louder, and it seems to be true for Alex as well. Actually let us test this real quick
alex_g_avg <- alex_g %>% group_by(album_name) %>% summarise(album_release_year = mean(album_release_year), avg_loudness = mean(loudness))
summary(lm(loudness ~ album_release_year, data = alex_g))
##
## Call:
## lm(formula = loudness ~ album_release_year, data = alex_g)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2013 -1.5312 0.3082 1.9228 6.1427
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1058.7570 259.5210 -4.080 0.000106 ***
## album_release_year 0.5212 0.1288 4.046 0.000119 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.974 on 80 degrees of freedom
## Multiple R-squared: 0.1699, Adjusted R-squared: 0.1595
## F-statistic: 16.37 on 1 and 80 DF, p-value: 0.0001194
ggplot(alex_g, aes(x = album_release_year, y = loudness, fill = album_name)) +
geom_boxplot() +
scale_fill_manual(values = alb_df$album_color) +
theme_classic() +
labs(title = "Has Alex gotten louder over time?",
fill = "Albums")
So, to an extent, yes the theory is correct, his music seems to get louder as time passes, and at least Rules and Trick had a relatively larger distribution of volume levels compared to his later work.
ggplot(alex_g, aes(x = track_number, y= energy, fill = album_name)) +
geom_bar(stat = "identity") +
facet_grid(album_name~.) +
scale_fill_manual(values = alb_df$album_color) +
scale_x_continuous(breaks = 1:16) +
theme_classic() +
ylab("Energy (0-1)") +
xlab("Track Number") +
theme(legend.position = "none") +
labs(title = "Energy barplot for each Alex G album")
ggplot(alex_g, aes(x = track_number, y= danceability, fill = album_name)) +
geom_bar(stat = "identity") +
facet_grid(album_name~.) +
scale_fill_manual(values = alb_df$album_color) +
scale_x_continuous(breaks = 1:16) +
theme_classic() +
ylab("Danceability (0-1)") +
xlab("Track Number") +
theme(legend.position = "none") +
labs(title = "Danceability barplot for each Alex G album")
Is there a secret formula to Alex G’s top hits? Maybe we can use this data to figure it out! By intersecting the different Spotify qualities and 666posting members favorite songs we can see if the distribution of favorites is different than Alex’s discography. Remember though, this will be excluding non-officially released tracks.
songs_only <- songs_dat %>% select(Timestamp, track_name = favorite_song) %>% mutate(track_name = case_when(
track_name == "Sugar House" ~ "SugarHouse - Live",
T ~ track_name
))
filtered_songs <- left_join(songs_only,alex_g) %>% drop_na() %>% select(track_name, danceability, energy, valence) %>% mutate(source = "Fans")
filtered_alex_g <- alex_g %>% select(track_name, danceability, energy, valence) %>% mutate(source = "Discography")
compare_fans <- rbind(filtered_songs, filtered_alex_g)
melt_fans <- melt(compare_fans)
ggplot(melt_fans, aes(x = value, ..scaled.., fill = source)) + geom_density(alpha = .5) +
facet_grid(variable~.) +
theme_classic() +
labs(title = "Comparing Spotify song qualities of Alex's discography to fan favorite songs",
subtitle = paste0("N = ",nrow(filtered_songs)),
fill = "Fan favorites or
total discography") +
ylab("Density") +
xlab("Value of Quality")
ks.test(filtered_songs$danceability,filtered_alex_g$danceability)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: filtered_songs$danceability and filtered_alex_g$danceability
## D = 0.1733, p-value = 0.08157
## alternative hypothesis: two-sided
ks.test(filtered_songs$energy,filtered_alex_g$energy)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: filtered_songs$energy and filtered_alex_g$energy
## D = 0.084243, p-value = 0.8441
## alternative hypothesis: two-sided
ks.test(filtered_songs$valence,filtered_alex_g$valence)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: filtered_songs$valence and filtered_alex_g$valence
## D = 0.076701, p-value = 0.9127
## alternative hypothesis: two-sided
Well these are interesting distributions. Alex has a somewhat normal curve for danceability, but there seems two be a bimodal distribution amongst fan favorite songs. As a result, it does lean more towards danceability preference. With energy people prefer lower energy, with fan preference taking a dip around 0.8. Lastly with valence, which I will remind is basically happiness, there is a low peak around .30 of sad peep with some happy nerds around .75, breaking through the discography distribution. Statistically speaking, using Kolmogorov-Smirnov tests, which tests if there is a similar distribution, we see that there is no significant difference, but that bimodal distribution from danceability trends towards signifigance.
##Pull personality data
pers_dat <- dat %>% select(Timestamp, mbti, hp_house, sign, weed_use, alc_use)
#Figure size
hght <- 7.29
wdth <- 4.5
Myer-Briggs is the zeitgeist when it comes to personality tests. For more information on them check out this website: https://www.16personalities.com/free-personality-test . While there are sixteen personalities, they are not evenly distributed in the population according to data from the official MBTI types, some make over 10% of the population others closer to 2%. These types can also be broken down further into their more discrete types based on the letters, Extrovert vs Introvert (E v I) for example. I have two figures below showing both of these distributions in the general population. We will then compare this to the Alex G data.
Not really sure what the best question to ask here is besides the distribution and how the distribution differs from the normal world. Data pulled from here
#Reduced the the percentaghe of ISFJ ESFJ and ISTJ because everywhere I look the percent is over 100
mb_types = c( .137,
.122,
.115,
.088,
.087,
.085,
.081,
.054,
.044,
.043,
.033,
.032,
.025,
.021,
.018,
.015)
mb_names = c("ISFJ",
"ESFJ",
"ISTJ",
"ISFP",
"ESTJ",
"ESFP",
"ENFP",
"ISTP",
"INFP",
"ESTP",
"INTP",
"ENTP",
"ENFJ",
"INTJ",
"ENTJ",
"INFJ")
names(mb_types) = mb_names
mb_df <- as.data.frame(mb_types) %>%
rownames_to_column(var = "mbtis") %>%
rename(mbti_percent = mb_types) %>%
mutate(IE = ifelse(str_detect(mbtis, "I"), "I", "E"),
SN = ifelse(str_detect(mbtis, "S"), "S", "N"),
FT = ifelse(str_detect(mbtis, "F"), "F", "T"),
JP = ifelse(str_detect(mbtis, "J"), "J", "P"))
#Better way to do this?
EI <- mb_df %>% group_by(IE) %>% summarise(sum(mbti_percent))
SN <- mb_df %>% group_by(SN) %>% summarise(sum(mbti_percent))
FT <- mb_df %>% group_by(FT) %>% summarise(sum(mbti_percent))
JP <- mb_df %>% group_by(JP) %>% summarise(sum(mbti_percent))
colnames(EI) <- c("type", "percent type")
colnames(SN) <- c("type", "percent type")
colnames(FT) <- c("type", "percent type")
colnames(JP) <- c("type", "percent type")
letter_df <- rbind(EI,SN,FT,JP)
category <- factor(c("E or I", "E or I", "N or S", "N or S", "F or T", "F or T", "J or P", "J or P"),
c("E or I", "N or S", "F or T", "J or P"))
letter_df <- cbind(letter_df, category)
General population plots Myer-Briggs Type
#Factor levels to maintain look in plot
mb_df$mbtis <- factor(mb_df$mbtis, levels = mb_names)
gen_mbti <- ggplot(mb_df,
aes(x = mbtis,
y = mbti_percent,
fill = mbtis)) +
geom_col(color = "black") +
scale_y_continuous(labels = scales::percent) +
ylab("Myer-Briggs Type Percent") +
xlab("Myer-Briggs Type")
gen_mbti +
labs(title = "Population Myer-Briggs Types",
subtitle = "N Unknown, from studies between 1972-2002 (pls update website)") +
theme_classic()
gen_mbti_2 <- gen_mbti +
labs(title = "Population Myer-Briggs Types",
subtitle = "N Unknown") +
theme_classic() + coord_flip()
ggsave(filename = "outputs/myer-brigg-gen-pop.png", width = hght, height = wdth)
letter_df$type <- factor(letter_df$type, c("E", "I", "N", "S", "F", "T", "J", "P"))
gen_lett <- ggplot(letter_df, aes(x = category, y = `percent type`, fill = type)) +
geom_bar(position = "stack", stat = "Identity") +
scale_fill_brewer(palette = "Set1", name = "Discrete Types", labels = c("Extrovert", "Introvert",
"Intuitives", "Sensors",
"Feeling", "Thinking",
"Judging", "Percieving"
))
gen_lett +
labs(title = "Population Myer-Briggs Single Letter Types",
subtitle = "N Unknown, from studies between 1972-2002") +
theme_classic() + scale_y_continuous(labels = scales::percent)
gen_lett_2 <- gen_lett +
labs(title = "Population Letter Types",
subtitle = "N Unknown, studies between 1972-2002") +
theme_classic() + scale_y_continuous(labels = scales::percent)
ggsave(filename = "outputs/myer-brigg-lett-gen-pop.png", width = hght, height = wdth)
#### Prepare data
##Alex Data
mbti_dat <- as.data.frame(table(pers_dat$mbti), stringsAsFactors = F) %>%
filter(!(Var1 %in% c("",
"Prefer not to answer",
"I don't know"))) %>%
rename(mbtis = Var1, count = Freq) %>%
mutate(mbti_percent = count/sum(count))
#They're missing data because of people gotta fill in some blanks
missing_dat <- data.frame(as.character(mb_df$mbtis[!(mb_df$mbtis %in% mbti_dat$mbtis)]),
rep(0, length(as.character(mb_df$mbtis[!(mb_df$mbtis %in% mbti_dat$mbtis)]))),
rep(0, length(as.character(mb_df$mbtis[!(mb_df$mbtis %in% mbti_dat$mbtis)]))))
colnames(missing_dat) <- c("mbtis", "count", "mbti_percent")
mbti_dat <- rbind(mbti_dat, missing_dat)
mbti_dat$mbtis <- factor(mbti_dat$mbtis, levels = mb_names)
#### Single Letter Myer-Briggs Data
mbti_dat <- mbti_dat %>%
mutate(IE = ifelse(str_detect(mbtis, "I"), "I", "E"),
SN = ifelse(str_detect(mbtis, "S"), "S", "N"),
FT = ifelse(str_detect(mbtis, "F"), "F", "T"),
JP = ifelse(str_detect(mbtis, "J"), "J", "P"))
#Better way to do this?
EI_g <- mbti_dat %>% group_by(IE) %>% summarise(sum(count),sum(mbti_percent))
SN_g <- mbti_dat %>% group_by(SN) %>% summarise(sum(count),sum(mbti_percent))
FT_g <- mbti_dat %>% group_by(FT) %>% summarise(sum(count),sum(mbti_percent))
JP_g <- mbti_dat %>% group_by(JP) %>% summarise(sum(count),sum(mbti_percent))
colnames(EI_g) <- c("type", "count type", "percent type")
colnames(SN_g) <- c("type", "count type", "percent type")
colnames(FT_g) <- c("type", "count type", "percent type")
colnames(JP_g) <- c("type", "count type", "percent type")
letter_df_g <- rbind(EI_g,SN_g,FT_g,JP_g)
category <- factor(c("E or I", "E or I", "N or S", "N or S", "F or T", "F or T", "J or P", "J or P"),
c("E or I", "N or S", "F or T", "J or P"))
letter_df_g <- cbind(letter_df_g, category)
Run the Chi-square analysis against general population probabilities
##Running the chi-square
#mbti_dat$mbtis == mb_df$mbtis
mbti_dat <- mbti_dat[match(mb_df$mbtis, mbti_dat$mbtis),]
#mbti_dat$mbtis == mb_df$mbtis
mbti_chisq <- chisq.test(mbti_dat$count, p = mb_df$mbti_percent, simulate.p.value = T)
#Can't lump all letters so here we go slow again
EI_chisq <- chisq.test(letter_df_g$`count type`[1:2],
p = letter_df$`percent type`[1:2],
simulate.p.value = T)
NS_chisq <- chisq.test(letter_df_g$`count type`[3:4],
p = letter_df$`percent type`[3:4],
simulate.p.value = T)
FT_chisq <- chisq.test(letter_df_g$`count type`[5:6],
p = letter_df$`percent type`[5:6],
simulate.p.value = T)
JP_chisq <- chisq.test(letter_df_g$`count type`[7:8],
p = letter_df$`percent type`[7:8],
simulate.p.value = T)
pvals_lett <- format.pval(c(EI_chisq$p.value, NS_chisq$p.value, FT_chisq$p.value, JP_chisq$p.value), 3)
##Plotting the data
surv_mbti <- ggplot(mbti_dat,
aes(x = mbtis,
y = count,
fill = mbtis)) +
geom_col(color = "black") +
xlab("Myers–Briggs Type") +
ylab("Count") +
theme_classic()
surv_mbti +
labs(title = "666posting Myer-Briggs Types",
subtitle = paste0("N = ",
sum(mbti_dat$count),
" ; Chi-Square compared to population p-value = ",
format.pval(mbti_chisq$p.value, digits = 3)))
ggsave(filename = "outputs/myer-brigg-666.png", width = hght, height = wdth)
surv_mbti_2 <- surv_mbti +
labs(title = "666posting Myer-Briggs Types",
subtitle = paste0("N = ",
sum(mbti_dat$count),
" ; Chi-Square p-value = ",
format.pval(mbti_chisq$p.value, digits = 3))) + coord_flip()
letter_df_g$type <- factor(letter_df_g$type, c("E", "I", "N", "S", "F", "T", "J", "P"))
surv_lett <- ggplot(letter_df_g, aes(x = category, y = `percent type`, fill = type)) +
geom_bar(position = "stack", stat = "Identity") +
scale_fill_brewer(palette = "Set1", name = "Discrete Types",
labels = c("Extrovert", "Introvert",
"Intuitives", "Sensors",
"Feeling", "Thinking",
"Judging", "Percieving"
)) +
theme_classic() + scale_y_continuous(labels = scales::percent)
surv_lett +
labs(title = "666posting Myer-Briggs Single Letter Types",
subtitle = paste("P-values per letter group in respective order:",
paste(pvals_lett, collapse = ' ; ')))
surv_lett_2 <- surv_lett +
labs(title = "666posting Letter Types",
subtitle = paste("P-values",
paste(pvals_lett, collapse = ' ; ')))
ggsave(filename = "outputs/myer-brigg-lett-666.png", width = hght, height = wdth)
The group, compared to the general population, well, they look NOTHING alike. While the population data is relatively outdated cough cough hasn’t been updated since 2002 , our distribution is not close to it at all, below I will have the outputs side by side. A good portion of our members come from the rarer types, and specifically, 18% of participating members are the rarest type, INFJ, which makes up ~2% of the normal population. Crazy. This, of course, carries over to the discrete types as well, the starkest difference is the Intuitives considerably outnumber the Sensors in this group (I don’t know what that means, but if you want to explain it in the comments feel free to), which is the opposite in the general population. It’s wacky folks.
gpt_mbti <- ggarrange(gen_mbti_2, surv_mbti_2,
ncol = 2, common.legend = T)
annotate_figure(gpt_mbti, top = text_grob(
"Comparison of Myer-Briggs Types"
))
gpt_lett <- ggarrange(gen_lett_2, surv_lett_2,
ncol = 2, common.legend = T)
annotate_figure(gpt_lett, top = text_grob(
"Comparison of Myer-Briggs Letters"
))
Note: The reason the P-values are the same is that I have the chi-square set to simulate.pvalues as the distribution of values throws a warning otherwise
Gonna be honest, I am under qualified to say anything about what zodiac signs mean, so I’m going to leave these results up to your interpretation. That said, one thing I do know is these signs break up into the four classic elements, fire, water, air, and earth which are included in my analysis. So to generate the population data I pulled CDC birth records from 2000-2014 as an estimate. There appears to be a relatively uniform distribution of signs, and similarly elements in the general population.
Pulling public data to make a fair general population example and then preparing it
us_birth <- read.csv(
"https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv")
all_births = sum(us_birth$births)
us_birth_sum <- us_birth %>%
mutate(sign = case_when(
(month == 1 & date_of_month >= 21) | (month == 2 & date_of_month <= 18) ~ "Aquarius",
(month == 2 & date_of_month >= 19) | (month == 3 & date_of_month <= 20) ~ "Pisces",
(month == 3 & date_of_month >= 21) | (month == 4 & date_of_month <= 19) ~ "Aries",
(month == 4 & date_of_month >= 20) | (month == 5 & date_of_month <= 20) ~ "Taurus",
(month == 5 & date_of_month >= 21) | (month == 6 & date_of_month <= 20) ~ "Gemini",
(month == 6 & date_of_month >= 21) | (month == 7 & date_of_month <= 22) ~ "Cancer",
(month == 7 & date_of_month >= 23) | (month == 8 & date_of_month <= 22) ~ "Leo",
(month == 8 & date_of_month >= 23) | (month == 9 & date_of_month <= 22) ~ "Virgo",
(month == 9 & date_of_month >= 23) | (month == 10 & date_of_month <= 22) ~ "Libra",
(month == 10 & date_of_month >= 23) | (month == 11 & date_of_month <= 21) ~ "Scorpio",
(month == 11 & date_of_month >= 22) | (month == 12 & date_of_month <= 21) ~ "Sagittarius",
(month == 12 & date_of_month >= 22) | (month == 1 & date_of_month <= 21) ~ "Capricorn"
)) %>% group_by(sign) %>%
summarise(total_sign = sum(births)) %>%
mutate(element = case_when(
sign %in% c("Aries", "Leo", "Sagittarius") ~ "Fire",
sign %in% c("Gemini", "Libra", "Aquarius") ~ "Air",
sign %in% c("Taurus", "Virgo", "Capricorn") ~ "Earth",
sign %in% c("Cancer", "Scorpio", "Pisces") ~ "Water"
),
sign_percent = (total_sign/sum(us_birth$births)))
Making both colors and general population plots for zodiac signs
#Make element colors
elem_cols = c("whitesmoke", "forestgreen", "firebrick3","deepskyblue4")
##All signs of population
gen_sign <- ggplot(us_birth_sum, aes(x = sign, y = sign_percent, fill = element)) +
geom_col(color = "black") +
scale_fill_manual(values = elem_cols) +
scale_y_continuous(labels = scales::percent) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylab("Percent") +
xlab("Sign")
gen_sign +
labs(title = "General Population Signs (births 2000-2014)",
subtitle = paste0("N = ",
format(sum(us_birth_sum$total_sign),
big.mark = ",",
scientific = F)))
gen_sign_2 <- gen_sign +
labs(title = "General Population Signs",
subtitle = paste0("N = ",
format(sum(us_birth_sum$total_sign),
big.mark = ",",
scientific = F)))
ggsave(filename = "outputs/horo-signs-gen-pop.png", width = hght, height = wdth)
Also zodiac elements
#Element info
element_summ <- us_birth_sum %>%
group_by(element) %>%
summarise(count = sum(total_sign)) %>%
mutate(element_percent = count/sum(count))
#Element plot
gen_element<- ggplot(element_summ, aes(x = element, y = element_percent, fill = element)) +
geom_col(color = "black") +
scale_fill_manual(values = elem_cols) +
scale_y_continuous(labels = scales::percent) +
theme_classic() +
ylab("Percent") +
xlab("Elements") +
theme(legend.position = "None")
gen_element + labs(title = "General Population Sign Elements (births 2000-2014)",
subtitle = paste0("N = ",
format(sum(element_summ$count),
big.mark = ",",
scientific = F)))
gen_element_2 <- gen_element + labs(title = "General Population Sign Elements",
subtitle = paste0("N = ",
format(sum(element_summ$count),
big.mark = ",",
scientific = F)))
ggsave(filename = "outputs/horo-elements-gen-pop.png", width = hght, height = wdth)
In our group however, this aforementioned balanced distribution doesn’t hold true. Capricorns make up around 3% of our group while Gemini makes up ~12%. That said, this is not a significant change in the distribution according to a chi-square analysis. The same is true for elements, the change nears significance, but does not cross the arbitrary (thanks Fisher) 0.05 threshold, though overall there are less Earth and Water signs in this group.
#####Now for the Alex G group -----
#Make Zodiac table
g_Zodiac <- as.data.frame(table(pers_dat$sign), stringsAsFactors = F) %>%
filter(!(Var1 %in% c("", "Prefer not to answer"))) %>%
rename(sign = Var1, count = Freq) %>%
mutate(sign_percent = count/sum(count))
#Add sign elements
g_Zodiac <- g_Zodiac %>%
mutate(element = case_when(
sign %in% c("Aries", "Leo", "Sagittarius") ~ "Fire",
sign %in% c("Gemini", "Libra", "Aquarius") ~ "Air",
sign %in% c("Taurus", "Virgo", "Capricorn") ~ "Earth",
sign %in% c("Cancer", "Scorpio", "Pisces") ~ "Water"
)
)
#Seperate element df
g_element <- g_Zodiac %>%
group_by(element) %>%
summarise(count = sum(count)) %>%
mutate(element_percent = count/sum(count))
#Chi-square analysis against normal population probabilities
##Chi-squares
#Make sure in correct order, should all be TRUE
#g_Zodiac$sign == us_birth_sum$sign
sign_chisq <- chisq.test(g_Zodiac$count, p = us_birth_sum$sign_percent)
#sign_chisq
#Not sig according to probabilities
#Elements
#g_element$element == element_summ$element
element_chisq <- chisq.test(g_element$count, p = element_summ$element_percent)
#element_chisq
Basic Alex G zodiac sign plot
#Alex G Zodiac plot
surv_sign <- ggplot(g_Zodiac, aes(x = sign, y = sign_percent, fill = element)) +
geom_col(color = "black") +
scale_fill_manual(values = elem_cols) +
scale_y_continuous(labels = scales::percent) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylab("Percent") +
xlab("Sign")
surv_sign +
labs(title = "666posting Zodiac Signs",
subtitle = paste0("N = ",
sum(g_Zodiac$count),
" ; Chi-Square compared to population p-value = ",
format.pval(sign_chisq$p.value, digits = 3)))
surv_sign_2 <- surv_sign +
labs(title = "666posting Zodiac Signs",
subtitle = paste0("N = ",
sum(g_Zodiac$count),
" ; Chi-Square p-value = ",
format.pval(sign_chisq$p.value, digits = 3)))
ggsave(filename = "outputs/horo-signs-666.png", width = hght, height = wdth)
Plotting the zodiac elements of the Alex G data
#Alex G element plot
surv_element <- ggplot(g_element, aes(x = element, y = element_percent, fill = element)) +
geom_col(color = "black") +
scale_fill_manual(values = elem_cols) +
scale_y_continuous(labels = scales::percent) +
theme_classic() +
ylab("Percent") +
xlab("Elements") +
theme(legend.position = "None")
surv_element +
labs(title = "666posting Zodiac Sign Elements",
subtitle = paste0("N = ",
sum(g_element$count),
" ; Chi-Square compared to population p-value = ",
format.pval(element_chisq$p.value, digits = 3)))
surv_element_2 <- surv_element +
labs(title = "666posting Zodiac Sign Elements",
subtitle = paste0("N = ",
sum(g_element$count),
" ; Chi-Square p-value = ",
format.pval(element_chisq$p.value, digits = 3)))
ggsave(filename = "outputs/horo-elements-666.png", width = hght, height = wdth)
In our group, when compared to the general population there does seem to be an imbalance. Capricorns make up around 3% of our group while Gemini makes up ~12%. That said, this is not a significant change in the distribution according to a chi-square analysis. The same is true for elements, the change nears significance, but does not cross the arbitrary (thanks Fisher) 0.05 threshold, though overall there are less Earth and Water signs in this group.
gpt_sign <- ggarrange(gen_sign_2, surv_sign_2,
ncol = 2, common.legend = T)
annotate_figure(gpt_sign, top = text_grob(
"Comparison of Astrological Signs"
))
gpt_element <- ggarrange(gen_element_2, surv_element_2,
ncol = 2, common.legend = T)
annotate_figure(gpt_element, top = text_grob(
"Comparison of Astrological Elements"
))
I’m also not great with Harry Potter info, but this was relatively interesting because there is some insight to be gained. I had to eyeball the general population percentages from the article below because they didn’t actually give the true numbers What’s interesting is that the main difference between the general population and 666posting is that we have more Slytherin. What makes this even more interesting, is according to the Time’s article, Slytherin makes a large proportion of the younger population, and I’d say that this group leans to the younger side (see previous above ages). And so, I’d argue this cohort doesn’t stray from the normal population distribution when adjusted for age. But we don’t have enough older people to properly adjust for age, so just take my word for it, K? Great!
Data taken from https://drive.google.com/file/d/0B8PCmhQmtcDKLXlzSGtnZ0hKbjQ/view Data taken from Time Magazine, they don’t have the actual percents posted, but a much larger sample size.
#My best guess from eyeing The Times
house_percent <- c(.15, .30, .47, .08)
house <- c("Gryffindor", "Hufflepuff", "Ravenclaw", "Slytherin")
house_colors <- c("firebrick", "goldenrod3", "dodgerblue4", "darkgreen")
america_house <- data.frame(house, house_percent, stringsAsFactors = F)
Re-plot the Times Data
gen_house <- ggplot(america_house,
aes(x = house, y = house_percent, fill = house)) +
geom_col(color = "black") +
scale_fill_manual(values = house_colors) +
scale_y_continuous(labels = scales::percent) +
xlab("Hogwarts Houses") +
theme_classic() +
theme(legend.position = "None")
gen_house +
labs(title = "US Hogwarts House Distribution (Time Magazine's Data)",
subtitle ="N = ~500,000")
gen_house_2 <- gen_house +
labs(title = "US Hogwarts House Distribution",
subtitle ="N = ~500,000")+
ggsave(filename = "outputs/hogwarts-gen-pop.png")
Now we have general population prepare 666posting data
#Should turn this into a function by this point
house_dat <- as.data.frame(table(pers_dat$hp_house), stringsAsFactors = F) %>%
filter(!(Var1 %in% c("",
"Prefer not to answer",
"I don't know"))) %>%
rename(house = Var1, count = Freq) %>%
mutate(house_percent = count/sum(count))
Chi-Square for houses
#Sig difference
house_chisq <- chisq.test(house_dat$count, p = house_percent)
house_chisq
##
## Chi-squared test for given probabilities
##
## data: house_dat$count
## X-squared = 26.732, df = 3, p-value = 6.699e-06
Plotting Alex G Hogwarts Houses
###Plot house
surv_house <- ggplot(house_dat,
aes(x = house, y = count, fill = house)) +
geom_col(color = "black") +
scale_fill_manual(values = house_colors) +
xlab("Hogwarts Houses") +
theme_classic() +
theme(legend.position = "None")
surv_house +
labs(title = "666posting Hogwarts Houses",
subtitle = paste0("N = ",
sum(house_dat$count),
" ; Chi-Square compared to population p-value = ",
format.pval(house_chisq$p.value, digits = 3)))
surv_house_2 <- surv_house +
labs(title = "666posting Hogwarts Houses",
subtitle = paste0("N = ",
sum(house_dat$count),
" ; Chi-Square p-value = ",
format.pval(house_chisq$p.value, digits = 3)))
ggsave(filename = "outputs/hogwarts-666.png", width = hght, height = wdth)
As I said above, the only thing of note is the change in Slytherin, which I attribute to the trend that Slytherin’s are genrally younger, and this group a younger sample when compared to the overall populatio.
gpt_house <- ggarrange(gen_house_2, surv_house_2,
ncol = 2, common.legend = T)
annotate_figure(gpt_house, top = text_grob(
"Comparison of Harry Potter Houses"
))
My last question, and the one I found most interesting is, are personality types related to one another? E.g. are certain MBTI types enriched in a given zodiac sign? Long story short, kinda, kinda not. I made a bunch of Heatmap Tables using an R package called Z-table and calculated Chi-squares and while there were a few hits I assure you none would stand up to any multiple testing correction, so I don’t have any larger statement to say here. I feel like we are underpowered in participants to test this hypothesis as not everyone answered all three survey questions, and even if we did we still might need a larger N.
note: I’m going to not echo these as the code is repetitive.
Using Ztables from this link
Ztable signs vs MBTI
hor_mbt_dat <- pers_dat %>% filter(!(mbti %in% c("", "Prefer not to answer", "I don't know")),
!(sign %in% c("", "Prefer not to answer", "I don't know"))) %>%
mutate(element = case_when(
sign %in% c("Aries", "Leo", "Sagittarius") ~ "Fire",
sign %in% c("Gemini", "Libra", "Aquarius") ~ "Air",
sign %in% c("Taurus", "Virgo", "Capricorn") ~ "Earth",
sign %in% c("Cancer", "Scorpio", "Pisces") ~ "Water"
)
)
#Sign
sign_tab <- table(hor_mbt_dat$mbti, hor_mbt_dat$sign)
sign_z <- ztable(sign_tab)
chi_sign_mbti <- chisq.test(hor_mbt_dat$mbti, hor_mbt_dat$sign, simulate.p.value = T)
#No way to format this nicely
sign_z %>% makeHeatmap() %>%
print(caption = paste0("Heatmap Table of Zodiac Signs and Myer-Briggs Types; Chi-Sqare p-value of ", format.pval(chi_sign_mbti$p.value, digits = 3)))
| Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ENFJ | 2 | 2 | 1 | 0 | 0 | 0 | 2 | 0 | 2 | 1 | 0 | 1 |
| ENFJ | ENFP | ENTJ | ENTP | ESFP | INFJ | INFP | INTJ | INTP | ISFJ | ISFP | ISTP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Air | 4 | 9 | 0 | 1 | 1 | 7 | 16 | 4 | 9 | 0 | 1 | 2 |
Ztable for Zodiac Signs and HP Houses
| Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gryffindor | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 4 | 0 | 1 | 2 |
| Air | Earth | Fire | Water | |
|---|---|---|---|---|
| Gryffindor | 6 | 4 | 7 | 2 |
| Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
|---|---|---|---|---|
| ENFJ | 3 | 1 | 3 | 1 |
Prepare data
letter_dat <- pers_dat %>% filter(!(mbti %in% c("", "Prefer not to answer", "I don't know")),
!(sign %in% c("", "Prefer not to answer", "I don't know"))) %>%
mutate(IE = ifelse(str_detect(mbti, "I"), "I", "E"),
SN = ifelse(str_detect(mbti, "S"), "S", "N"),
FT = ifelse(str_detect(mbti, "F"), "F", "T"),
JP = ifelse(str_detect(mbti, "J"), "J", "P")) %>%
mutate(element = case_when(
sign %in% c("Aries", "Leo", "Sagittarius") ~ "Fire",
sign %in% c("Gemini", "Libra", "Aquarius") ~ "Air",
sign %in% c("Taurus", "Virgo", "Capricorn") ~ "Earth",
sign %in% c("Cancer", "Scorpio", "Pisces") ~ "Water"
)
)
Sign and IE
| Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| E | 4 | 7 | 8 | 2 | 5 | 5 | 6 | 2 | 6 | 4 | 3 | 4 |
| Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N | 16 | 14 | 11 | 9 | 17 | 15 | 17 | 14 | 21 | 10 | 9 | 18 |
| Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F | 15 | 14 | 9 | 5 | 11 | 15 | 12 | 8 | 19 | 7 | 10 | 11 |
| Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| J | 3 | 3 | 3 | 2 | 5 | 6 | 7 | 5 | 10 | 7 | 2 | 8 |
| Air | Earth | Fire | Water | |
|---|---|---|---|---|
| E | 15 | 9 | 18 | 14 |
| Air | Earth | Fire | Water | |
|---|---|---|---|---|
| N | 50 | 36 | 50 | 35 |
| Air | Earth | Fire | Water | |
|---|---|---|---|---|
| F | 38 | 26 | 48 | 24 |
| Air | Earth | Fire | Water | |
|---|---|---|---|---|
| J | 15 | 12 | 19 | 15 |
Prepare data
House IE| Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
|---|---|---|---|---|
| E | 9 | 7 | 13 | 9 |
| Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
|---|---|---|---|---|
| N | 18 | 33 | 36 | 24 |
| Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
|---|---|---|---|---|
| F | 14 | 33 | 27 | 18 |
| Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
|---|---|---|---|---|
| J | 8 | 14 | 13 | 7 |
Step one, per usual, is cleaning up the data for our specific purposes, one of the main changes is shortening of names and adding a fun variable about alcohol consumption.
dat_habits <- dat %>%
select(id, age, gender, weed_use, alc_use, concerts, country, state) %>%
mutate(weed_use = case_when(
weed_use == "I used to smoke weed, but now I don't" ~ "I don't smoke weed anymore",
T ~ weed_use
),
alc_use = case_when(
alc_use == "I used to drink, but don't anymore" ~ "I don't drink anymore",
T ~ alc_use
),
`Drinks in the USA` = ifelse(country == "United States" &
alc_use != "I never had alcohol", "Yes", "No"),
underage_drinking = ifelse(`Drinks in the USA` == "Yes" & age <= 20,
"Yes",
"No"))
Let us first make some barplots of both drugs followed by a z-table to see what our distribution of recorded substance usage is. I will also run a Chi-square (because the function already does that) to see if the distribution of usage is unexpected, aka, are weed use and alcohol use random, or are they correlated in some manner. I am also ordering factors to make the tables and following graphs more readable.
dat_drugs <- dat_habits %>% filter(alc_use != "", weed_use != "")
dat_drugs$alc_use <- factor(dat_drugs$alc_use, c("I never had alcohol",
"I don't drink anymore",
"Once a month/socially",
"Once a week",
"Multiple times per week",
"Every day",
"Multiple times a day"))
dat_drugs$weed_use <- factor(dat_drugs$weed_use, c("I have never smoked weed",
"I don't smoke weed anymore",
"Once a month",
"Once a week",
"Multiple times per week",
"Every day",
"Multiple times a day"))
Bringing in data about weed use about the world world and the states
world_weed <- read.csv("inputs/weed_world_legality.csv", stringsAsFactors = F, encoding = "UTF-8-BOM")
colnames(world_weed)[1:3] <- c("country",
"recreational",
"medical")
world_weed <- mutate(world_weed, weed_laws = case_when(
medical == "Illegal" ~ "not legal",
str_detect(medical, "Legal") ~ "medical/complicated",
recreational == "Legal" ~ "recreational",
TRUE ~ "decriminalized/complicated"
))
state_weed <- read.csv("inputs/state_marijuana_laws_10_2016.csv", stringsAsFactors = F)
colnames(state_weed) <- c("state", "medical", "recreational", "no_laws")
state_weed <- mutate(state_weed, weed_laws = case_when(
no_laws == "Yes" ~ "not legal",
recreational == "Yes" ~ "recreational",
medical == "Yes" ~ "medical"
))
alc_count <- dat_drugs %>% count(alc_use)
weed_count <- dat_drugs %>% count(weed_use)
ggplot(alc_count, aes(x=alc_use, y = n, fill=alc_use)) +
geom_col(color="black", ) +
coord_flip() +
scale_fill_viridis_d(option = "A", direction = -1) +
theme_classic() +
theme(legend.position = "none") +
geom_text(aes(label = n), nudge_y = 3, size=4) +
labs(title = "Alcohol Usage in 666posting",
subtitle = paste("N =", sum(alc_count$n))) +
labs(fill = "Alcohol Usage") +
xlab("Alcohol Usage") +
ylab("Count of Individuals")
ggsave("outputs/alcohol_usage_barplot.png")
ggplot(weed_count, aes(x=weed_use, y = n, fill=weed_use)) +
geom_col(color="black", ) +
coord_flip() +
scale_fill_viridis_d(direction = -1) +
theme_classic() +
theme(legend.position = "none") +
geom_text(aes(label = n), nudge_y = 3, size=4) +
labs(title = "Weed Usage in 666posting",
subtitle = paste("N =", sum(weed_count$n))) +
labs(fill = "Weed Usage")+
xlab("Weed Usage") +
ylab("Count of Individuals")
ggsave("outputs/weed_usage_barplot.png")
Now lets look at that Z table
auto_ztable(dat_drugs,
"alc_use",
"weed_use",
"Heatmap Table of Alcohol and Weed Usage",
simul_p = T)
| I have never smoked weed | I don't smoke weed anymore | Once a month | Once a week | Multiple times per week | Every day | Multiple times a day | |
|---|---|---|---|---|---|---|---|
| I never had alcohol | 5 | 2 | 0 | 0 | 1 | 0 | 1 |
So it appears, according to the the Chi-Square that our data might be trending towards significance, but there is nothing to be concluded here. Overall it appears most people use to smoke weed, but not anymore, and most people seem to drink socially.
Now let us see if there are any interesting distributions with age
#Make a histogram of ages and habits
#dat_drugs$alc_use <- factor(dat_drugs$alc_use, names(sort(table(dat_drugs$alc_use), decreasing= F)))
ggplot(dat_drugs, aes(x=age, fill=alc_use)) +
geom_histogram(breaks=seq(15, 45, by=1), color="black") +
scale_fill_viridis_d(option = "A", direction = -1) +
theme_classic() +
labs(title = "Alcohol Usage Histogram in 666posting",
subtitle = paste("N =", sum(alc_count$n)),
fill = "Alcohol Usage")
ggsave("outputs/alcohol_histogram.png")
Above is an interesting way to view the distribution, but to see differences there are better plots such as the box-plot below:
ggplot(dat_drugs, aes(y=age, x= alc_use, fill=alc_use)) +
geom_violin() +
geom_sina(color = "grey40", alpha = .7) +
stat_n_text(size = 3) +
theme_classic() +
scale_fill_viridis_d(option = "A", direction = -1) +
coord_flip() +
theme(legend.position = "none") +
labs(title = "Alcohol Usage in 666posting",
subtitle = paste("N =", sum(alc_count$n)),
fill = "Alcohol Usage")
ggsave("outputs/alcohol_boxplot.png")
These distributions don’t seem that different from one another. As a test, I’ll run a linear model to see if these drinking groups are a good predictor of age. That said, I’m not really going to check any assumptions (don’t do this), I will remove the one older individual because they are an outlier, but I don’t expect there to be a difference.
no_out_drugs <- dat_drugs %>% filter(age <= 38) %>%
mutate(center_age = age - mean(age))
no_out_drugs %>%
group_by(alc_use)%>%
get_summary_stats(age, type = "mean_sd")
## # A tibble: 7 x 5
## alc_use variable n mean sd
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 I never had alcohol age 9 21.8 5.12
## 2 I don't drink anymore age 15 23.7 4.17
## 3 Once a month/socially age 75 22.3 2.66
## 4 Once a week age 42 23.5 2.79
## 5 Multiple times per week age 57 24.3 3.51
## 6 Every day age 9 24.7 1.94
## 7 Multiple times a day age 2 26.5 0.707
lm_alc <- lm(center_age ~ alc_use, data = no_out_drugs)
summary(lm_alc)
##
## Call:
## lm(formula = center_age ~ alc_use, data = no_out_drugs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2667 -2.2667 -0.2667 2.3333 10.2667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5045 1.0528 -1.429 0.1545
## alc_useI don't drink anymore 1.9556 1.3317 1.468 0.1435
## alc_useOnce a month/socially 0.4889 1.1142 0.439 0.6613
## alc_useOnce a week 1.6984 1.1602 1.464 0.1448
## alc_useMultiple times per week 2.4854 1.1329 2.194 0.0294 *
## alc_useEvery day 2.8889 1.4889 1.940 0.0537 .
## alc_useMultiple times a day 4.7222 2.4691 1.913 0.0572 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.159 on 202 degrees of freedom
## Multiple R-squared: 0.08829, Adjusted R-squared: 0.06121
## F-statistic: 3.26 on 6 and 202 DF, p-value: 0.004415
Nothing significant, but social drinking trends towards younger individuals which is interesting.
And now I’m going to reveal all the filthy filthy lawbreakers in this group, I’m talking under-age drinking. (I’m being sarcastic, but I think its an interesting question)
#Alcohol based on country
dat_drugs$alc_use <- factor(dat_drugs$alc_use, names(sort(table(dat_drugs$alc_use), decreasing= F)))
ggplot(dat_drugs, aes(x=age, fill=`Drinks in the USA`)) +
geom_histogram(breaks=seq(15, 45, by=1), color="black") +
geom_vline(xintercept = 21, linetype = "dashed", color = "red", size = 1.4) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Woah, that's illegal buddies!",
subtitle = paste0("I'm reporting all ",
sum(dat_drugs$underage_drinking == "Yes"),
" of you to the police! /s")) +
theme_classic()
ggsave("outputs/underage_drinking.png")
Disgusting…
Alright on to the completely legal topic of weed usage
#Weed
#dat_drugs$weed_use <- factor(dat_drugs$weed_use, names(sort(table(dat_drugs$weed_use), decreasing= F)))
ggplot(dat_drugs, aes(x=age, fill=weed_use)) +
geom_histogram(breaks=seq(15, 45, by=1), color="black") +
scale_fill_viridis_d(direction = -1) +
labs(title = "Weed Usage in 666posting",
subtitle = paste("N =", sum(weed_count$n)),
fill = "Weed Usage") +
theme_classic()
ggsave("outputs/weed_usage_histogram.png")
Again the better way to view the data
ggplot(dat_drugs, aes(y=age, x= weed_use, fill=weed_use)) +
geom_violin() +
geom_sina(color = "grey40", alpha = .7) +
stat_n_text(size = 3) +
scale_fill_viridis_d(direction = -1) +
coord_flip() +
labs(title = "Weed Usage in 666posting",
subtitle = paste("N =", sum(weed_count$n)),
fill = "Weed Usage") +
xlab("Weed Usage") +
theme_classic() +
theme(legend.position = "none")
ggsave("outputs/weed_usage_boxplot.png")
Let us test if there is a significant difference between age and weed use.
no_out_drugs %>%
group_by(weed_use)%>%
get_summary_stats(age, type = "mean_sd")
## # A tibble: 7 x 5
## weed_use variable n mean sd
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 I have never smoked weed age 19 22 3.40
## 2 I don't smoke weed anymore age 82 23.4 3.46
## 3 Once a month age 35 23.4 3.09
## 4 Once a week age 7 25.1 2.91
## 5 Multiple times per week age 26 22.4 2.37
## 6 Every day age 15 21.9 2.15
## 7 Multiple times a day age 25 25.0 3.35
lm_weed <- lm(center_age ~ weed_use, data = no_out_drugs)
summary(lm_weed)
##
## Call:
## lm(formula = center_age ~ weed_use, data = no_out_drugs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3537 -2.3537 -0.4231 1.9600 9.6463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.28230 0.72731 -1.763 0.07940 .
## weed_useI don't smoke weed anymore 1.35366 0.80719 1.677 0.09509 .
## weed_useOnce a month 1.40000 0.90341 1.550 0.12278
## weed_useOnce a week 3.14286 1.40171 2.242 0.02604 *
## weed_useMultiple times per week 0.42308 0.95684 0.442 0.65885
## weed_useEvery day -0.06667 1.09500 -0.061 0.95151
## weed_useMultiple times a day 3.04000 0.96489 3.151 0.00188 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.17 on 202 degrees of freedom
## Multiple R-squared: 0.08148, Adjusted R-squared: 0.0542
## F-statistic: 2.987 on 6 and 202 DF, p-value: 0.008121
Weed usage interestingly shows older people in this group tend to consume weed more than their younger counterparts. I feel this is somewhat driven by outliers, but “once a week” weed usage, with no outliers, was also significant. Part of me wonders how much of this has to do with the financial situation/living situation. Perhaps older individuals smoke more because they have both the financial ability and space (free from those who might look down on it) to do so. I’m not sure, it would’ve been pretty weird if I asked “what is your yearly income?” and “Do you live with an authority figure?”.
### Add world and US weed laws
map <- match(dat_drugs$country, world_weed$country)
dat_drugs$weed_laws <- world_weed$weed_laws[map]
map <- match(dat_drugs$state, state_weed$state)
ind <- dat_drugs$country %in% "United States"
dat_drugs$weed_laws[ind] <- (state_weed$weed_laws[map])[ind]
dat_drugs <- mutate(dat_drugs, weed_laws = case_when(
state == "District of Columbia" ~ "recreational",
state == "" ~ "medical/complicated",
TRUE ~ weed_laws
))
dat_drugs$weed_laws <- factor(dat_drugs$weed_laws,
c("not legal",
"medical/complicated",
"medical",
"decriminalized/complicated",
"recreational"))
dat_drugs <- mutate(dat_drugs, `General Weed Use` = ifelse(
weed_use == "I have never smoked weed" | weed_use == "I don't smoke weed anymore",
"Doesn't actively smoke weed",
"Actively smokes weed"
))
drug_table <- table(dat_drugs$`weed_laws`, dat_drugs$`General Weed Use`)
z <- ztable(drug_table,
caption = "Weed use to compared to country/state laws",
align = "ccccccc")
z$caption <- "Weed use to compared to country/state laws"
z$position <- "ccccccc"
z %>%
addCellColor(rows = 2, cols = 2, "lava") %>%
addCellColor(rows = 2:6, cols = 3, "indiagreen") %>%
addCellColor(rows = 3:4, cols = 2, "orange") %>%
addCellColor(rows = 5:6, cols = 2, "indiagreen")
| Actively smokes weed | Doesn't actively smoke weed | |
|---|---|---|
| not legal | 22 | 25 |
dat_habits$concerts <- factor(dat_habits$concerts, levels = unique(dat_habits$concerts)[c(6,4,2,3,5,1)])
ggplot(dat_habits, aes(x=concerts, fill = concerts)) +
geom_bar(color = "black") + coord_flip() +
labs(title = "How often do you go to concerts?",
subtitle = paste("N = ", nrow(dat_habits))) +
scale_fill_brewer(palette = "Dark2") +
theme_classic() +
geom_text(stat='count', aes(label=..count..), position = position_dodge(width = 1), hjust = -.08) +
theme(legend.position = "none")
## Preparing the data
alex_g_conc_dat <- dat %>% select(Timestamp, alex_g_concerts, state, living_area, concerts) %>%
filter(!is.na(alex_g_concerts)) %>%
mutate(`Lives Near Philly?` = ifelse(state %in% c("Pennsylvania",
"New Jersey",
"Delaware"), "Yes", "No"))
alex_g_conc_dat$concerts <- factor(alex_g_conc_dat$concerts, levels = unique(alex_g_conc_dat$concerts)[c(6,4,2,3,5,1)])
Making a histogram showing the general results. As you can see there are some outliers, but I have theories on why these outliers exist.
ggplot(alex_g_conc_dat, aes(x = alex_g_concerts)) +
geom_bar(color="black") +
theme_classic()
My theory is they live near Philly, and by using the state data I’ve collected I can determine that two of the largest outliers indeed do live near Philly (defined by living in Pennsylvania, New Jersey or Delaware). That said, the individual who has seen Alex 40 times lives in Missouri.
Fun fact: According to setlist.fm Alex has played 284 concerts
ggplot(alex_g_conc_dat, aes(x = alex_g_concerts, fill = `Lives Near Philly?`)) +
geom_bar(color="black") +
scale_fill_brewer(palette = "Dark2") +
annotate(geom="text", x=c(37, 17), y=5, label=c("Lives in Missouri", "Lives in South Carolina"),
color="Black") +
labs(
title = "Does seeing Alex G concerts correlate with
living near Philly?",
subtitle = paste0("N = ", nrow(alex_g_conc_dat))
) +
xlab("How many Alex G concerts have you seen?") +
ylab("Count") +
theme_classic()
ggsave(filename = "outputs/alex_g_conc_philly.png", width = hght, height = wdth)
But another theory I had is that maybe living near cities in general will increase the probability of an individual going to more concerts
alex_g_conc_dat$living_area <- factor(alex_g_conc_dat$living_area,
names(sort(table(alex_g_conc_dat$living_area),
decreasing= F)))
ggplot(alex_g_conc_dat, aes(x = alex_g_concerts, fill = living_area)) +
geom_bar(color="black") +
labs(fill = "Area of Living") +
theme_classic()
ggplot(alex_g_conc_dat, aes(y = alex_g_concerts, x = living_area, fill = living_area)) +
geom_violin() +
geom_sina() +
coord_flip() +
stat_n_text(size = 3) +
theme_classic() +
theme(legend.position = "none")
no_out_conc <- alex_g_conc_dat %>% filter(alex_g_concerts < 10)
ggplot(no_out_conc, aes(y = alex_g_concerts, x = living_area, fill = living_area)) +
geom_violin() +
geom_sina() +
coord_flip() +
scale_fill_brewer(palette = "Set1") +
stat_n_text(size = 3) +
theme_classic() +
labs(fill = "Area of Living") +
theme(legend.position = "none") +
labs(
title = "Does seeing Alex G concerts depend on
how close you are to a city?",
subtitle = paste0("N = ", nrow(no_out_conc))
) +
xlab("Living area") +
ylab("Number of Alex G Concerts")
ggsave(filename = "outputs/alex_g_where_live.png", width = hght, height = wdth)
ggplot(alex_g_conc_dat, aes(y = alex_g_concerts, x = concerts, fill = concerts)) +
geom_violin() +
geom_sina() +
coord_flip() +
stat_n_text(size = 3)+
theme_classic() +
theme(legend.position = "none")
ggplot(no_out_conc, aes(y = alex_g_concerts, x = concerts, fill = concerts)) +
geom_violin() +
geom_sina() +
coord_flip() +
scale_fill_brewer(palette = "Dark2") +
theme(legend.position = "none") +
stat_n_text(size = 3) +
labs(
title = "How many Alex G concerts have you seen
depending on your concert habits",
subtitle = paste0("N = ", nrow(no_out_conc))
) +
xlab("Concert Habits") +
ylab("Number of Alex G Concerts") +
theme_classic() +
theme(legend.position = "none")
ggsave(filename = "outputs/alex_g_conc_hab.png", width = hght, height = wdth)
That “once a year if ever” dude with 40 Alex G concerts must be 40 then…. something is telling me that might’ve been a mistake.
I decided to use some linear models to check if any of these variable were good predictors
conc_lm <- lm(formula = alex_g_concerts ~ living_area + `Lives Near Philly?` + concerts,
data = alex_g_conc_dat)
sum_conc_lm <- summary(conc_lm)
sum_conc_lm$coefficients
## Estimate Std. Error t value
## (Intercept) 1.744968e-13 3.740517 4.665046e-14
## living_areaIn the outskirts of a city -1.999617e-01 1.313950 -1.521837e-01
## living_areaIn the suburbs 1.120113e-01 1.256890 8.911790e-02
## living_areaIn a city 1.381653e+00 1.198747 1.152581e+00
## `Lives Near Philly?`Yes 3.707999e+00 1.062184 3.490920e+00
## concertsOnce a year if ever 2.093536e+00 3.979742 5.260482e-01
## concertsEvery 3-4 months 9.088468e-01 3.916566 2.320520e-01
## concertsEvery month or every other month 1.736156e+00 3.951054 4.394159e-01
## concertsEvery two weeks 6.820543e-02 4.059121 1.680301e-02
## concertsWhenever I can 1.572409e+00 3.932301 3.998699e-01
## Pr(>|t|)
## (Intercept) 1.0000000000
## living_areaIn the outskirts of a city 0.8791954668
## living_areaIn the suburbs 0.9290773914
## living_areaIn a city 0.2504583279
## `Lives Near Philly?`Yes 0.0005922191
## concertsOnce a year if ever 0.5994375140
## concertsEvery 3-4 months 0.8167349029
## concertsEvery month or every other month 0.6608346103
## concertsEvery two weeks 0.9866105225
## concertsWhenever I can 0.6896790533
It seems that the best predictor in this case is living near Philly! But that’s not an honest analysis, because this result is likely driven by outliers, so we’re going to toss them.
## Without the outliers
conc_lm <- lm(formula = alex_g_concerts ~ living_area + `Lives Near Philly?` + concerts,
data = no_out_conc)
sum_conc_lm <- summary(conc_lm)
sum_conc_lm$coefficients
## Estimate Std. Error t value
## (Intercept) -1.407823e-15 1.7205786 -8.182264e-16
## living_areaIn the outskirts of a city 3.009316e-01 0.6048521 4.975292e-01
## living_areaIn the suburbs 3.222993e-01 0.5786990 5.569377e-01
## living_areaIn a city 8.150397e-01 0.5520281 1.476446e+00
## `Lives Near Philly?`Yes 1.220924e+00 0.5274623 2.314714e+00
## concertsOnce a year if ever 5.119795e-01 1.8319731 2.794689e-01
## concertsEvery 3-4 months 1.265932e+00 1.8016165 7.026646e-01
## concertsEvery month or every other month 1.698775e+00 1.8178097 9.345178e-01
## concertsEvery two weeks 8.652422e-01 1.8676406 4.632809e-01
## concertsWhenever I can 1.397884e+00 1.8090075 7.727353e-01
## Pr(>|t|)
## (Intercept) 1.00000000
## living_areaIn the outskirts of a city 0.61937358
## living_areaIn the suburbs 0.57820542
## living_areaIn a city 0.14142904
## `Lives Near Philly?`Yes 0.02166441
## concertsOnce a year if ever 0.78017970
## concertsEvery 3-4 months 0.48309847
## concertsEvery month or every other month 0.35118676
## concertsEvery two weeks 0.64367706
## concertsWhenever I can 0.44060990
So it still appears that the biggest determinant of seeing Alex G concerts is if you live near Philly, which makes sense as, before Alex was touring most of his shows were in that area (and people could’ve included Skin Cells concerts as Alex G concerts). That said, if you remove the outliers, in the dataset, living near Philly is no longer as strong a predictor on how many Alex G concerts you’ve seen, but is still the biggest one.
ggplot(no_out_conc, aes(y = alex_g_concerts, x = `Lives Near Philly?`, fill = `Lives Near Philly?`)) +
geom_violin() +
geom_jitter(weight = .25, height = 0.2) +
coord_flip() +
scale_fill_brewer(palette = "Dark2") +
theme_classic() +
theme(legend.position = "none") +
stat_n_text(size = 3) +
labs(
title = "How many Alex G concerts have you seen
base on living near Philly",
subtitle = paste0("N = ", nrow(no_out_conc))
) +
xlab("Live near Philly?") +
ylab("Number of Alex G Concerts")
I didn’t know where this section fit best, so I just put it here. I didn’t really have any hypotheses here either, but I hope you enjoy the data!
bovie_dat <- dat %>% select(age, movie_genre, book_genre) %>%
filter(movie_genre != "", book_genre != "")
movie_plot <- ggplot(bovie_dat, aes(movie_genre)) +
geom_bar() +
coord_flip() +
xlab("Movie Genres") +
theme_classic() +
geom_text(stat='count', aes(label=..count..), position = position_dodge(width = 1), hjust = -.08) +
labs(title = "Favorite Movie Genres",
subtitle = paste0("N = ",nrow(bovie_dat)))
movie_plot
book_plot <- ggplot(bovie_dat, aes(book_genre)) +
geom_bar() +
coord_flip() +
xlab("Book Genres") +
theme_classic() +
geom_text(stat='count', aes(label=..count..), position = position_dodge(width = 1), hjust = -.08) +
labs(title = "Favorite Book Genres",
subtitle = paste0("N = ",nrow(bovie_dat)))
book_plot
auto_ztable(bovie_dat,
"movie_genre",
"book_genre",
"Heatmap Table of book and movie genres",
simul_p = T)
| Crime | Drama | Fantasy | Historical Fiction | Horror | I don't read books | Mystery | Nonfiction | Poetry | Romance | Satire | Sci-Fi | Suspense/Thriller | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Action | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
Another question we asked is what type of college major did people have, or if they even decided to attend college which we look at below.
college_dat <- dat %>% mutate(id = row_number()) %>%
select(id, col_major)
college_dat_melt <- college_dat %>%
mutate(col_major = (str_split(col_major, ","))) %>%
unnest(cols = col_major) %>%
mutate(col_major =(str_trim(col_major, "both"))) %>%
filter(!(col_major %in% c("Used to","but not anymore", "", "Nope")))
#General stats
sum_major <- as.data.frame(table(college_dat_melt$col_major),
stringsAsFactors = F) %>%
mutate(Var1 = str_to_title(Var1))
#Check data looks okay
#unique(band_dat_melt$col_major)
#Make dataframe giving a 1 or 0 to each person based on if they play the col_major or not, 1 is TRUE and 0 is false
college_df <- (as.data.frame.matrix(table(college_dat_melt$id, college_dat_melt$col_major)))
#Make Instrument strings look like titles
colnames(college_df) <- str_to_title(colnames(college_df))
degree = 1
degree_list <- list()
#Probably should change this to a lapply, but we use this loop to generate a list of ids that play a given instrument
for (degree in 1:ncol(college_df)) {
degree_list[[colnames(college_df)[degree]]] <- (1:nrow(college_df))[college_df[,degree] == 1]
}
Upset plot for different majors to see what double majors we have
plot_degree <- upset(fromList(degree_list),
order.by = "freq",
nsets =20,
#point.size = 4,
#line.size = 1.8,
text.scale=c(2.8, 1.6, 1.8, 1.6, 1.3, 2),
mainbar.y.label = "Major Intersections",
sets.x.label = "People Per Major"
)
plot_degree
two_degrees <- sum(rowSums(college_df) == 2)
three_degrees <- sum(rowSums(college_df) == 3)
four_degrees <- sum(rowSums(college_df) == 4)
We have 23 people with 2 majors, 4 people with three majors, and 1 with four majors.
library(lubridate)
# x = dat$country
# rank = 1
# to_low = T
ranked_top <- function(x, rank = 1, to_low = T){
temp <- names(sort(table(x),decreasing=TRUE)[rank])
if (to_low == T) {
temp <- tolower(temp)
}
return(temp)
}
#When the average person filled out the survey
#mean(mdy_hm(dat$Timestamp))
#Average age
avg_age <- median(dat$age)
#Country
avg_country <- ranked_top(dat$country,to_low = F)
#State
avg_state <- ranked_top(dat$state, to_low = F, rank = 3)
#Race
avg_race <- ranked_top(dat$race)
#They are male
avg_gender <- ranked_top(dat$gender)
#They are cis
avg_identity <- ranked_top(dat$transgender)
#They are straight
avg_orientation <- ranked_top(dat$orientation)
#They are monogomous
avg_poly <- ranked_top(dat$polyamorous)
#They started listening to alex G at the age of 19
avg_start_listen <- median(dat$first_age)
#They've been to 2 Alex G concerts
avg_alex_conc <- median(dat$alex_g_concerts, na.rm = T)
# They found Alex G through a friend
avg_discovery <- ranked_top(dat$discover_method)
#Their favorite song is Snot
avg_fave_song <- ranked_top(dat$favorite_song, to_low = FALSE)
#Their favorite album is Trick
avg_fave_album <- ranked_top(dat$favorite_album, to_low = FALSE)
#They don't listen to Alex G's unreleased albums
avg_unreleased <- ranked_top(dat$unreleased_album)
#They like the gretel music video
avg_music_vid <- ranked_top(dat$music_vid, to_low = FALSE)
#They listen to music via spotify
avg_music_source <- ranked_top(dat$music_method)
#They probably play guitar
avg_instrument <- (sum_instr %>% arrange(desc(Freq)))[1,1]
#They are not in a band
avg_band <- ranked_top(dat$in_band)
#Favorite musician
avg_musician <- ranked_top(musician_dat$favorite_musician, to_low = FALSE)
#Second fave musician
second_musician <- ranked_top(musician_dat$sec_musician, rank = 1, to_low = FALSE)
#Non Alex musician
second_2_musician <- ranked_top(musician_dat$sec_musician, rank = 2, to_low = FALSE)
####NEED TO DO ART
avg_art <- (sum_art %>% arrange(desc(Freq)))[1,1]
#Major
avg_major <- (sum_major %>% arrange(desc(Freq)))[1,1]
##How often they go to concerts
avg_concert_go <- ranked_top(dat$concerts)
# Weed
avg_weed <- ranked_top(dat$weed_use)
#Alcohol
avg_alcohol <- ranked_top(dat$alc_use)
#Living area
avg_living <- ranked_top(dat$living_area)
#authoritarian_lib
med_lib_auth <- median(dat$lib_auth, na.rm = T)
#Left right
med_left_right <- median(dat$left_right, na.rm = T)
#MBTI
med_meyers <- ranked_top(dat$mbti, to_low = F)
#Harry potter
med_harry <- ranked_top(dat$hp_house)
#Astro-sign
med_sign <- ranked_top(dat$sign, to_low = F)
So around 2020-05-08 09:57:30 a white heterosexual/straight male who is 23 years old decided to fill out the survey. They live in United States, specifically in California and when asked if they prefer multiple romantic partners they generally say no. Their favorite musician is Alex G, when you ask them their second favorite musician they’ll emphasize how much they like Alex G, before balking and saying Modest Mouse. They have a lot of opinions about Alex G, their favorite song is Snot, their favorite music video is Gretel, and their favorite album is Trick. That said, they when asked about what their favorite fan compilation is they reply, i don’t listen to them., which is probably because they listen to most of their music through spotify which is somewhat difficult to get non-official tracks on. Their favorite song is definitely Snot
Nevertheless, they’re still a big Alex G fan, they started listening to him when they were 19 years old, and have been to 2 concerts so far. When ask what they like to do in their spare time they have a couple hobbies. They like to play the Guitar, but when asked if they’re in a band, they’d say no. They also use writing to express themselves, and when it comes to concerts they go whenever i can. They are studying/have studied the Stem Field and live in a city. When it comes to weed they would say “i used to smoke weed, but now i don’t” and with alcohol they’d say they’d drink “once a month/socially”.
When it comes to their political leanings, on the scale of 1 to 10 from left politics to right, and from libertarian to authoritarian, they fall around 2 and 3 respectively. Their personality, if you asked them arbitrarily their MBTI, astrological sign, and Harry Potter House they would also tell you, respectively, INFP, Gemini, and “i don’t know”. They find all of these questions you’re asking them kinda weird, almost like this format that I decided to explain the most average Alex G fan didn’t pan out the way I wanted, but they know everyone is just trying their best.
In the order Facebook decided:
Special shoutout to Isabelle for allowing and promoting the survey.
spotifyrGraham aka Grumpus for talking to me about Alex G back when I knew diddly squat besides Mary. Thank you for introducing me to one of my favorite artists of all time.